diff mbox

[3/3] ovl: redirect on rename-dir

Message ID 1477380887-21333-4-git-send-email-mszeredi@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Miklos Szeredi Oct. 25, 2016, 7:34 a.m. UTC
Current code returns EXDEV when a directory would need to be copied up to
move.  We could copy up the directory tree in this case, but there's
another solution: point to old lower directory from moved upper directory.

This is achieved with a "trusted.overlay.redirect" xattr storing the path
relative to the root of the overlay.  After such attribute has been set,
the directory can be moved without further actions required.

This is a backward incompatible feature, old kernels won't be able to
correctly mount an overlay containing redirected directories.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 Documentation/filesystems/overlayfs.txt | 21 ++++++-
 fs/overlayfs/copy_up.c                  | 20 ++-----
 fs/overlayfs/dir.c                      | 86 +++++++++++++++++++---------
 fs/overlayfs/namei.c                    | 99 ++++++++++++++++++++++++++++++---
 fs/overlayfs/overlayfs.h                |  4 ++
 fs/overlayfs/ovl_entry.h                |  4 ++
 fs/overlayfs/super.c                    | 25 +++++----
 fs/overlayfs/util.c                     | 19 +++++++
 8 files changed, 217 insertions(+), 61 deletions(-)

Comments

Raphael Hertzog Oct. 25, 2016, 11:57 a.m. UTC | #1
Hello Miklos,

thanks for your work on this patch set!

On Tue, 25 Oct 2016, Miklos Szeredi wrote:
> +renaming directories
> +--------------------
> +
> +When renaming a directory that is on the lower layer or merged (i.e. the
> +directory was not created on the upper layer to start with) overlayfs can
> +handle it in two different ways:
> +
> +1) return EXDEV error: this error is returned by rename(2) when trying to
> +   move a file or directory across filesystem boundaries.  Hence
> +   applications are usually prepared to hande this error (mv(1) for example
> +   recursively copies the directory tree).  This is the default behavior.
> +
> +2) If the "redirect_dir" feature is enabled, then the directory will be
> +   copied up (but not the contents).  Then the "trusted.overlay.redirect"
> +   extended attribute is set to the path of the original location from the
> +   root of the overlay.  Finally the directory is moved to the new
> +   location.

From this, I understand that we will have to add "redirect_dir" to the
mount options for this feature to work. Is there any reason why you
put this as an opt-in feature?

Do you plan to make it the default in the future when it has been
available for a while?

Barring any regression introduced by your patch, it seems that the feature
is best available by default since it allows legitimate operations to
succeed that are otherwise refused. I understand that it makes it
impossible to mount the overlay filesystem with an older kernel but is
that problem more widespread than the one we're fixing here?  On my side,
overlayfs is only used in scenarios where the kernel is always the same
(or newer compared to what created the initial filesystem).

Cheers,
Amir Goldstein Oct. 25, 2016, 12:49 p.m. UTC | #2
On Tue, Oct 25, 2016 at 10:34 AM, Miklos Szeredi <mszeredi@redhat.com> wrote:
> Current code returns EXDEV when a directory would need to be copied up to
> move.  We could copy up the directory tree in this case, but there's
> another solution: point to old lower directory from moved upper directory.
>
> This is achieved with a "trusted.overlay.redirect" xattr storing the path
> relative to the root of the overlay.  After such attribute has been set,
> the directory can be moved without further actions required.
>

Nice! I played with a similar, but simpler idea last month:
https://github.com/amir73il/linux/commit/a3907e732984fb4fd340544ca27c45cac8be8060
it only handle renames within same parent.
Down the road, you may want to consider differentiating this simpler case
of "name redirect", as it incurs less path lookups. An entry can always be
promoted from "name redirect" to "full path redirect"

> This is a backward incompatible feature, old kernels won't be able to
> correctly mount an overlay containing redirected directories.
>
> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> ---
>  Documentation/filesystems/overlayfs.txt | 21 ++++++-
>  fs/overlayfs/copy_up.c                  | 20 ++-----
>  fs/overlayfs/dir.c                      | 86 +++++++++++++++++++---------
>  fs/overlayfs/namei.c                    | 99 ++++++++++++++++++++++++++++++---
>  fs/overlayfs/overlayfs.h                |  4 ++
>  fs/overlayfs/ovl_entry.h                |  4 ++
>  fs/overlayfs/super.c                    | 25 +++++----
>  fs/overlayfs/util.c                     | 19 +++++++
>  8 files changed, 217 insertions(+), 61 deletions(-)
>
> diff --git a/Documentation/filesystems/overlayfs.txt b/Documentation/filesystems/overlayfs.txt
> index 5108425157ac..fae48ab3b36b 100644
> --- a/Documentation/filesystems/overlayfs.txt
> +++ b/Documentation/filesystems/overlayfs.txt
> @@ -130,6 +130,23 @@ directory.
>  Readdir on directories that are not merged is simply handled by the
>  underlying directory (upper or lower).
>
> +renaming directories
> +--------------------
> +
> +When renaming a directory that is on the lower layer or merged (i.e. the
> +directory was not created on the upper layer to start with) overlayfs can
> +handle it in two different ways:
> +
> +1) return EXDEV error: this error is returned by rename(2) when trying to
> +   move a file or directory across filesystem boundaries.  Hence
> +   applications are usually prepared to hande this error (mv(1) for example
> +   recursively copies the directory tree).  This is the default behavior.
> +
> +2) If the "redirect_dir" feature is enabled, then the directory will be
> +   copied up (but not the contents).  Then the "trusted.overlay.redirect"
> +   extended attribute is set to the path of the original location from the
> +   root of the overlay.  Finally the directory is moved to the new
> +   location.
>
>  Non-directories
>  ---------------
> @@ -201,8 +218,8 @@ If a file with multiple hard links is copied up, then this will
>  "break" the link.  Changes will not be propagated to other names
>  referring to the same inode.
>
> -Directory trees are not copied up.  If rename(2) is performed on a directory
> -which is on the lower layer or is merged, then -EXDEV will be returned.
> +Unless "redirect_dir" feature is enabled, rename(2) on a lower or merged
> +directory will fail with EXDEV.
>
>  Changes to underlying filesystems
>  ---------------------------------
> diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
> index 49f6158bb04c..0d7075208099 100644
> --- a/fs/overlayfs/copy_up.c
> +++ b/fs/overlayfs/copy_up.c
> @@ -323,17 +323,11 @@ static int ovl_copy_up_locked(struct dentry *workdir, struct dentry *upperdir,
>  /*
>   * Copy up a single dentry
>   *
> - * Directory renames only allowed on "pure upper" (already created on
> - * upper filesystem, never copied up).  Directories which are on lower or
> - * are merged may not be renamed.  For these -EXDEV is returned and
> - * userspace has to deal with it.  This means, when copying up a
> - * directory we can rely on it and ancestors being stable.
> - *
> - * Non-directory renames start with copy up of source if necessary.  The
> - * actual rename will only proceed once the copy up was successful.  Copy
> - * up uses upper parent i_mutex for exclusion.  Since rename can change
> - * d_parent it is possible that the copy up will lock the old parent.  At
> - * that point the file will have already been copied up anyway.
> + * All renames start with copy up of source if necessary.  The actual
> + * rename will only proceed once the copy up was successful.  Copy up uses
> + * upper parent i_mutex for exclusion.  Since rename can change d_parent it
> + * is possible that the copy up will lock the old parent.  At that point
> + * the file will have already been copied up anyway.
>   */
>  int ovl_copy_up_one(struct dentry *parent, struct dentry *dentry,
>                     struct path *lowerpath, struct kstat *stat)
> @@ -345,7 +339,6 @@ int ovl_copy_up_one(struct dentry *parent, struct dentry *dentry,
>         struct path parentpath;
>         struct dentry *lowerdentry = lowerpath->dentry;
>         struct dentry *upperdir;
> -       struct dentry *upperdentry;
>         const char *link = NULL;
>
>         if (WARN_ON(!workdir))
> @@ -371,8 +364,7 @@ int ovl_copy_up_one(struct dentry *parent, struct dentry *dentry,
>                 pr_err("overlayfs: failed to lock workdir+upperdir\n");
>                 goto out_unlock;
>         }
> -       upperdentry = ovl_dentry_upper(dentry);
> -       if (upperdentry) {
> +       if (ovl_dentry_upper(dentry)) {
>                 /* Raced with another copy-up?  Nothing to do, then... */
>                 err = 0;
>                 goto out_unlock;
> diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
> index 2c1057d747cb..065e0211f9b6 100644
> --- a/fs/overlayfs/dir.c
> +++ b/fs/overlayfs/dir.c
> @@ -757,6 +757,40 @@ static bool ovl_type_merge_or_lower(struct dentry *dentry)
>         return OVL_TYPE_MERGE(type) || !OVL_TYPE_UPPER(type);
>  }
>
> +static bool ovl_can_move(struct dentry *dentry)
> +{
> +       return ovl_redirect_dir(dentry->d_sb) ||
> +               !d_is_dir(dentry) || !ovl_type_merge_or_lower(dentry);
> +}
> +
> +static int ovl_set_redirect(struct dentry *dentry)
> +{
> +       char *buf;
> +       char *path;
> +       int err;
> +
> +       if (ovl_dentry_is_redirect(dentry))
> +               return 0;
> +
> +       buf = __getname();
> +       if (!buf)
> +               return -ENOMEM;
> +
> +       path = dentry_path_raw(dentry, buf, PATH_MAX);
> +       err = PTR_ERR(path);
> +       if (IS_ERR(path))
> +               goto putname;
> +
> +       err = ovl_do_setxattr(ovl_dentry_upper(dentry), OVL_XATTR_REDIRECT,
> +                             path, strlen(path), 0);
> +putname:
> +       __putname(buf);
> +       if (!err)
> +               ovl_dentry_set_redirect(dentry);
> +
> +       return err;
> +}
> +
>  static int ovl_rename(struct inode *olddir, struct dentry *old,
>                       struct inode *newdir, struct dentry *new,
>                       unsigned int flags)
> @@ -784,9 +818,9 @@ static int ovl_rename(struct inode *olddir, struct dentry *old,
>
>         /* Don't copy up directory trees */
>         err = -EXDEV;
> -       if (is_dir && ovl_type_merge_or_lower(old))
> +       if (!ovl_can_move(old))
>                 goto out;
> -       if (!overwrite && new_is_dir && ovl_type_merge_or_lower(new))
> +       if (!overwrite && !ovl_can_move(new))
>                 goto out;
>
>         err = ovl_want_write(old);
> @@ -837,7 +871,6 @@ static int ovl_rename(struct inode *olddir, struct dentry *old,
>
>         trap = lock_rename(new_upperdir, old_upperdir);
>
> -
>         olddentry = lookup_one_len(old->d_name.name, old_upperdir,
>                                    old->d_name.len);
>         err = PTR_ERR(olddentry);
> @@ -880,31 +913,34 @@ static int ovl_rename(struct inode *olddir, struct dentry *old,
>         if (WARN_ON(olddentry->d_inode == newdentry->d_inode))
>                 goto out_dput;
>
> -       if (is_dir && !old_opaque && ovl_lower_positive(new)) {
> -               err = ovl_set_opaque(olddentry);
> -               if (err)
> -                       goto out_dput;
> -               ovl_dentry_set_opaque(old, true);
> +       if (is_dir) {
> +               if (ovl_type_merge_or_lower(old)) {
> +                       err = ovl_set_redirect(old);

There is a fair chance of getting ENOSPC/EDQUOT here and confuse user space.
Would it be better to convert these non fatal errors with EXDEV, so
user space will
gracefully fallback to recursive rename/clone/copy?

> +                       if (err)
> +                               goto out_dput;
> +               } else if (!old_opaque && ovl_lower_positive(new)) {
> +                       err = ovl_set_opaque(olddentry);
> +                       if (err)
> +                               goto out_dput;
> +                       ovl_dentry_set_opaque(old, true);
> +               }
>         }
> -       if (!overwrite &&
> -           new_is_dir && !new_opaque && ovl_lower_positive(old)) {
> -               err = ovl_set_opaque(newdentry);
> -               if (err)
> -                       goto out_dput;
> -               ovl_dentry_set_opaque(new, true);
> +       if (!overwrite && new_is_dir) {
> +               if (ovl_type_merge_or_lower(new)) {
> +                       err = ovl_set_redirect(new);
> +                       if (err)
> +                               goto out_dput;
> +               } else if (!new_opaque && ovl_lower_positive(old)) {
> +                       err = ovl_set_opaque(newdentry);
> +                       if (err)
> +                               goto out_dput;
> +                       ovl_dentry_set_opaque(new, true);
> +               }
>         }
>
> -       if (old_opaque || new_opaque) {
> -               err = ovl_do_rename(old_upperdir->d_inode, olddentry,
> -                                   new_upperdir->d_inode, newdentry,
> -                                   flags);
> -       } else {
> -               /* No debug for the plain case */
> -               BUG_ON(flags & ~RENAME_EXCHANGE);
> -               err = vfs_rename(old_upperdir->d_inode, olddentry,
> -                                new_upperdir->d_inode, newdentry,
> -                                NULL, flags);
> -       }
> +       err = ovl_do_rename(old_upperdir->d_inode, olddentry,
> +                           new_upperdir->d_inode, newdentry,
> +                           flags);
>         if (err)
>                 goto out_dput;
>
> diff --git a/fs/overlayfs/namei.c b/fs/overlayfs/namei.c
> index f4057fcb0246..c7cacbb8ce09 100644
> --- a/fs/overlayfs/namei.c
> +++ b/fs/overlayfs/namei.c
> @@ -8,6 +8,7 @@
>   */
>
>  #include <linux/fs.h>
> +#include <linux/mount.h>
>  #include <linux/namei.h>
>  #include <linux/xattr.h>
>  #include "overlayfs.h"
> @@ -48,6 +49,28 @@ static bool ovl_is_opaquedir(struct dentry *dentry)
>         return false;
>  }
>
> +static int ovl_check_redirect(struct dentry *dentry, char **bufp)
> +{
> +       int res;
> +
> +       res = vfs_getxattr(dentry, OVL_XATTR_REDIRECT, NULL, 0);
> +       if (res > 0) {
> +               char *buf = kzalloc(res + 1, GFP_KERNEL);
> +
> +               if (!buf)
> +                       return -ENOMEM;
> +
> +               res = vfs_getxattr(dentry, OVL_XATTR_REDIRECT, buf, res);
> +               if (res < 0)
> +                       return res;
> +
> +               kfree(*bufp);
> +               *bufp = buf;
> +       }
> +
> +       return 0;
> +}
> +
>  /*
>   * Returns next layer in stack starting from top.
>   * Returns -1 if this is the last layer.
> @@ -80,11 +103,13 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
>         unsigned int ctr = 0;
>         struct inode *inode = NULL;
>         bool upperopaque = false;
> +       bool upperredirect = false;
>         bool stop = false;
>         bool isdir = false;
>         struct dentry *this;
>         unsigned int i;
>         int err;
> +       char *redirect = NULL;
>
>         old_cred = ovl_override_creds(dentry->d_sb);
>         upperdir = ovl_upperdentry_dereference(poe);
> @@ -110,11 +135,23 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
>                                 isdir = true;
>                                 if (poe->numlower && ovl_is_opaquedir(this))
>                                         stop = upperopaque = true;
> +                               else if (ovl_redirect_dir(dentry->d_sb)) {
> +                                       err = ovl_check_redirect(this,
> +                                                                &redirect);
> +                                       if (err)
> +                                               goto out;
> +
> +                                       if (redirect)
> +                                               upperredirect = true;
> +                               }
>                         }
>                 }
>                 upperdentry = this;
>         }
>
> +       if (redirect)
> +               poe = dentry->d_sb->s_root->d_fsdata;
> +
>         if (!stop && poe->numlower) {
>                 err = -ENOMEM;
>                 stack = kcalloc(poe->numlower, sizeof(struct path), GFP_KERNEL);
> @@ -125,15 +162,39 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
>         for (i = 0; !stop && i < poe->numlower; i++) {
>                 struct path lowerpath = poe->lowerstack[i];
>
> -               this = ovl_lookup_real(lowerpath.dentry, &dentry->d_name);
> -               err = PTR_ERR(this);
> -               if (IS_ERR(this)) {
> -                       /*
> -                        * If it's positive, then treat ENAMETOOLONG as ENOENT.
> -                        */
> -                       if (err == -ENAMETOOLONG && (upperdentry || ctr))
> -                               continue;
> -                       goto out_put;
> +               if (!redirect) {
> +                       this = ovl_lookup_real(lowerpath.dentry, &dentry->d_name);
> +                       err = PTR_ERR(this);
> +                       if (IS_ERR(this)) {
> +                               /*
> +                                * If it's positive, then treat ENAMETOOLONG as ENOENT.
> +                                */
> +                               if (err == -ENAMETOOLONG && (upperdentry || ctr))
> +                                       continue;
> +                               goto out_put;
> +                       }
> +               } else {
> +                       struct path thispath;
> +
> +                       err = vfs_path_lookup(lowerpath.dentry, lowerpath.mnt,
> +                                             redirect, 0, &thispath);
> +
> +                       if (err) {
> +                               if (err == -ENOENT || err == -ENAMETOOLONG)
> +                                       this = NULL;
> +                       } else {
> +                               this = thispath.dentry;
> +                               mntput(thispath.mnt);
> +                               if (!this->d_inode) {
> +                                       dput(this);
> +                                       this = NULL;
> +                               } else if (ovl_dentry_weird(this)) {
> +                                       dput(this);
> +                                       err = -EREMOTE;
> +                               }
> +                       }
> +                       if (err)
> +                               goto out_put;
>                 }
>                 if (!this)
>                         continue;
> @@ -162,6 +223,23 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
>                 stack[ctr].dentry = this;
>                 stack[ctr].mnt = lowerpath.mnt;
>                 ctr++;
> +
> +               if (!stop && i != poe->numlower - 1 &&
> +                   d_is_dir(this) && ovl_redirect_dir(dentry->d_sb)) {
> +                       err = ovl_check_redirect(this, &redirect);
> +                       if (err)
> +                               goto out_put;
> +
> +                       if (redirect && poe != dentry->d_sb->s_root->d_fsdata) {
> +                               poe = dentry->d_sb->s_root->d_fsdata;
> +

Now you are about to continue looping until new value of poe->numlower,
which is >= then olf value of poe->numlower, but 'stack' was allocated
according to old value of poe->numlower, so aren't you in danger of
overflowing it?

Please add a comment to explain the purpose of this loop rewind.


> +                               for (i = 0; i < poe->numlower; i++)
> +                                       if (poe->lowerstack[i].mnt == lowerpath.mnt)
> +                                               break;
> +                               if (WARN_ON(i == poe->numlower))
> +                                       break;
> +                       }
> +               }
>         }
>
>         oe = ovl_alloc_entry(ctr);
> @@ -192,9 +270,11 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
>
>         revert_creds(old_cred);
>         oe->opaque = upperopaque;
> +       oe->redirect = upperredirect;
>         oe->__upperdentry = upperdentry;
>         memcpy(oe->lowerstack, stack, sizeof(struct path) * ctr);
>         kfree(stack);
> +       kfree(redirect);
>         dentry->d_fsdata = oe;
>         d_add(dentry, inode);
>
> @@ -209,6 +289,7 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
>  out_put_upper:
>         dput(upperdentry);
>  out:
> +       kfree(redirect);
>         revert_creds(old_cred);
>         return ERR_PTR(err);
>  }
> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> index d61d5b9d0d91..9d80ce367ad8 100644
> --- a/fs/overlayfs/overlayfs.h
> +++ b/fs/overlayfs/overlayfs.h
> @@ -20,6 +20,7 @@ enum ovl_path_type {
>  #define OVL_XATTR_PREFIX XATTR_TRUSTED_PREFIX "overlay."
>  #define OVL_XATTR_OPAQUE OVL_XATTR_PREFIX "opaque"
>  #define OVL_XATTR_FEATURES OVL_XATTR_PREFIX "features"
> +#define OVL_XATTR_REDIRECT OVL_XATTR_PREFIX "redirect"
>
>  #define OVL_ISUPPER_MASK 1UL
>
> @@ -157,6 +158,9 @@ void ovl_set_dir_cache(struct dentry *dentry, struct ovl_dir_cache *cache);
>  bool ovl_dentry_is_opaque(struct dentry *dentry);
>  bool ovl_dentry_is_whiteout(struct dentry *dentry);
>  void ovl_dentry_set_opaque(struct dentry *dentry, bool opaque);
> +bool ovl_redirect_dir(struct super_block *sb);
> +bool ovl_dentry_is_redirect(struct dentry *dentry);
> +void ovl_dentry_set_redirect(struct dentry *dentry);
>  void ovl_dentry_update(struct dentry *dentry, struct dentry *upperdentry);
>  void ovl_inode_init(struct inode *inode, struct inode *realinode,
>                     bool is_upper);
> diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
> index 3b7ba59ad27e..2b22645535ff 100644
> --- a/fs/overlayfs/ovl_entry.h
> +++ b/fs/overlayfs/ovl_entry.h
> @@ -26,6 +26,9 @@ struct ovl_fs {
>         struct ovl_config config;
>         /* creds of process who forced instantiation of super block */
>         const struct cred *creator_cred;
> +
> +       /* Check for "redirect" on directories */
> +       bool redirect_dir;
>  };
>
>  /* private information held for every overlayfs dentry */
> @@ -36,6 +39,7 @@ struct ovl_entry {
>                 struct {
>                         u64 version;
>                         bool opaque;
> +                       bool redirect;
>                 };
>                 struct rcu_head rcu;
>         };
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index d6dc8d905d00..fc22a8a2e0d0 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -397,7 +397,7 @@ static struct dentry *ovl_workdir_create(struct vfsmount *mnt,
>         goto out_unlock;
>  }
>
> -static int ovl_check_features(struct dentry *root)
> +static int ovl_check_features(struct ovl_fs *ufs, struct dentry *root)
>  {
>         int res;
>         char *buf, *tmp, *p;
> @@ -421,8 +421,12 @@ static int ovl_check_features(struct dentry *root)
>         res = 0;
>         tmp = buf;
>         while ((p = strsep(&tmp, ",")) != NULL) {
> -               res = -EINVAL;
> -               pr_err("overlayfs: feature '%s' not supported\n", p);
> +               if (strcmp(p, "redirect_dir") == 0) {
> +                       ufs->redirect_dir = true;
> +               } else {
> +                       res = -EINVAL;
> +                       pr_err("overlayfs: feature '%s' not supported\n", p);
> +               }
>         }
>  out_free:
>         kfree(buf);
> @@ -494,8 +498,8 @@ static int ovl_mount_dir(const char *name, struct path *path)
>         return err;
>  }
>
> -static int ovl_lower_dir(const char *name, struct path *path, long *namelen,
> -                        int *stack_depth, bool *remote)
> +static int ovl_lower_dir(const char *name, struct path *path,
> +                        struct ovl_fs *ufs, int *stack_depth, bool *remote)
>  {
>         int err;
>         struct kstatfs statfs;
> @@ -504,7 +508,7 @@ static int ovl_lower_dir(const char *name, struct path *path, long *namelen,
>         if (err)
>                 goto out;
>
> -       err = ovl_check_features(path->dentry);
> +       err = ovl_check_features(ufs, path->dentry);
>         if (err)
>                 goto out_put;
>
> @@ -513,7 +517,7 @@ static int ovl_lower_dir(const char *name, struct path *path, long *namelen,
>                 pr_err("overlayfs: statfs failed on '%s'\n", name);
>                 goto out_put;
>         }
> -       *namelen = max(*namelen, statfs.f_namelen);
> +       ufs->lower_namelen = max(ufs->lower_namelen, statfs.f_namelen);
>         *stack_depth = max(*stack_depth, path->mnt->mnt_sb->s_stack_depth);
>
>         if (ovl_dentry_remote(path->dentry))
> @@ -730,7 +734,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
>                         goto out_put_upperpath;
>                 }
>
> -               err = ovl_check_features(upperpath.dentry);
> +               err = ovl_check_features(ufs, upperpath.dentry);
>                 if (err)
>                         goto out_put_upperpath;
>
> @@ -771,9 +775,8 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
>
>         lower = lowertmp;
>         for (numlower = 0; numlower < stacklen; numlower++) {
> -               err = ovl_lower_dir(lower, &stack[numlower],
> -                                   &ufs->lower_namelen, &sb->s_stack_depth,
> -                                   &remote);
> +               err = ovl_lower_dir(lower, &stack[numlower], ufs,
> +                                   &sb->s_stack_depth, &remote);
>                 if (err)
>                         goto out_put_lowerpath;
>
> diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
> index 0d45a84468d2..06dae4cabd53 100644
> --- a/fs/overlayfs/util.c
> +++ b/fs/overlayfs/util.c
> @@ -176,6 +176,25 @@ void ovl_dentry_set_opaque(struct dentry *dentry, bool opaque)
>         oe->opaque = opaque;
>  }
>
> +bool ovl_redirect_dir(struct super_block *sb)
> +{
> +       struct ovl_fs *ofs = sb->s_fs_info;
> +
> +       return ofs->redirect_dir;
> +}
> +
> +bool ovl_dentry_is_redirect(struct dentry *dentry)
> +{
> +       struct ovl_entry *oe = dentry->d_fsdata;
> +       return oe->redirect;
> +}
> +
> +void ovl_dentry_set_redirect(struct dentry *dentry)
> +{
> +       struct ovl_entry *oe = dentry->d_fsdata;
> +       oe->redirect = true;
> +}
> +
>  void ovl_dentry_update(struct dentry *dentry, struct dentry *upperdentry)
>  {
>         struct ovl_entry *oe = dentry->d_fsdata;
> --
> 2.5.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-unionfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Miklos Szeredi Oct. 26, 2016, 11:12 a.m. UTC | #3
On Tue, Oct 25, 2016 at 1:57 PM, Raphael Hertzog <hertzog@debian.org> wrote:

> Do you plan to make it the default in the future when it has been
> available for a while?
>
> Barring any regression introduced by your patch, it seems that the feature
> is best available by default since it allows legitimate operations to
> succeed that are otherwise refused. I understand that it makes it
> impossible to mount the overlay filesystem with an older kernel but is
> that problem more widespread than the one we're fixing here?  On my side,
> overlayfs is only used in scenarios where the kernel is always the same
> (or newer compared to what created the initial filesystem).

I think it would be safe to make it the default if upperdir is empty.
Nonempty implies that it was created with old kernel (or it was
crafted by hand).   But there should be a way to explicitly turn it
off;  either because of the need for backward compatibility or because
the old format is simply easier to work with for humans.

How about:

- If upper is nonempty, then leave redirect feature alone except when
mount option "-oredirect=on" is used to force enabling it.
- If upper is empty, then enable redirect feature except when mount
option "-oredirect=off" is used to force disabling it.

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Miklos Szeredi Oct. 26, 2016, 11:26 a.m. UTC | #4
On Tue, Oct 25, 2016 at 2:49 PM, Amir Goldstein <amir73il@gmail.com> wrote:

>> @@ -880,31 +913,34 @@ static int ovl_rename(struct inode *olddir, struct dentry *old,
>>         if (WARN_ON(olddentry->d_inode == newdentry->d_inode))
>>                 goto out_dput;
>>
>> -       if (is_dir && !old_opaque && ovl_lower_positive(new)) {
>> -               err = ovl_set_opaque(olddentry);
>> -               if (err)
>> -                       goto out_dput;
>> -               ovl_dentry_set_opaque(old, true);
>> +       if (is_dir) {
>> +               if (ovl_type_merge_or_lower(old)) {
>> +                       err = ovl_set_redirect(old);
>
> There is a fair chance of getting ENOSPC/EDQUOT here and confuse user space.
> Would it be better to convert these non fatal errors with EXDEV, so
> user space will
> gracefully fallback to recursive rename/clone/copy?

Recursive copy up will surely consume more space than an xattr?

>> @@ -162,6 +223,23 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
>>                 stack[ctr].dentry = this;
>>                 stack[ctr].mnt = lowerpath.mnt;
>>                 ctr++;
>> +
>> +               if (!stop && i != poe->numlower - 1 &&
>> +                   d_is_dir(this) && ovl_redirect_dir(dentry->d_sb)) {
>> +                       err = ovl_check_redirect(this, &redirect);
>> +                       if (err)
>> +                               goto out_put;
>> +
>> +                       if (redirect && poe != dentry->d_sb->s_root->d_fsdata) {
>> +                               poe = dentry->d_sb->s_root->d_fsdata;
>> +
>
> Now you are about to continue looping until new value of poe->numlower,
> which is >= then olf value of poe->numlower, but 'stack' was allocated
> according to old value of poe->numlower, so aren't you in danger of
> overflowing it?
>
> Please add a comment to explain the purpose of this loop rewind.

We are jumping to a stack possibly wider than the current one and need
to find the layer where to continue the downward traversal.  I'll add
the comment.

BTW I don't remember having tested this, so it might possibly be
buggy.  Automatic multi-layer testing would really be good.  What we
basically need is:

 - create normal (two layer) overlay (with interesting constructs,
whiteout, opaque dir, redirect)
 - umount
 - create three layer overlay where the two lower layers come from the
previous upper/lower layers
 - do more interesting things

There's one such test in xfstests but it would be good to have more.

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Amir Goldstein Oct. 26, 2016, 12:11 p.m. UTC | #5
On Wed, Oct 26, 2016 at 2:26 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Tue, Oct 25, 2016 at 2:49 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>
>>> @@ -880,31 +913,34 @@ static int ovl_rename(struct inode *olddir, struct dentry *old,
>>>         if (WARN_ON(olddentry->d_inode == newdentry->d_inode))
>>>                 goto out_dput;
>>>
>>> -       if (is_dir && !old_opaque && ovl_lower_positive(new)) {
>>> -               err = ovl_set_opaque(olddentry);
>>> -               if (err)
>>> -                       goto out_dput;
>>> -               ovl_dentry_set_opaque(old, true);
>>> +       if (is_dir) {
>>> +               if (ovl_type_merge_or_lower(old)) {
>>> +                       err = ovl_set_redirect(old);
>>
>> There is a fair chance of getting ENOSPC/EDQUOT here and confuse user space.
>> Would it be better to convert these non fatal errors with EXDEV, so
>> user space will
>> gracefully fallback to recursive rename/clone/copy?
>
> Recursive copy up will surely consume more space than an xattr?

Surely. But that is not what I meant.
In Ext4, for example, the total size of extended attributes for an
inode is limited to a single block,
so you can get ENOSPC with alot of free space in the file system.

You can also get ERANGE if the redirect path length > 256.
I'm just saying that instead of failing the move with ERANGE or
uncalled for ENOSPC,
it is better for user space to get EXDEV, from which there is a sane fallback.
BTW, mv (from linux-utils) falls back from -EXDEV to recursive move,
which is quite
efficient with clone up.

>
>>> @@ -162,6 +223,23 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
>>>                 stack[ctr].dentry = this;
>>>                 stack[ctr].mnt = lowerpath.mnt;
>>>                 ctr++;
>>> +
>>> +               if (!stop && i != poe->numlower - 1 &&
>>> +                   d_is_dir(this) && ovl_redirect_dir(dentry->d_sb)) {
>>> +                       err = ovl_check_redirect(this, &redirect);
>>> +                       if (err)
>>> +                               goto out_put;
>>> +
>>> +                       if (redirect && poe != dentry->d_sb->s_root->d_fsdata) {
>>> +                               poe = dentry->d_sb->s_root->d_fsdata;
>>> +
>>
>> Now you are about to continue looping until new value of poe->numlower,
>> which is >= then olf value of poe->numlower, but 'stack' was allocated
>> according to old value of poe->numlower, so aren't you in danger of
>> overflowing it?
>>
>> Please add a comment to explain the purpose of this loop rewind.
>
> We are jumping to a stack possibly wider than the current one and need
> to find the layer where to continue the downward traversal.  I'll add
> the comment.
>
> BTW I don't remember having tested this, so it might possibly be
> buggy.  Automatic multi-layer testing would really be good.  What we
> basically need is:
>
>  - create normal (two layer) overlay (with interesting constructs,
> whiteout, opaque dir, redirect)
>  - umount
>  - create three layer overlay where the two lower layers come from the
> previous upper/lower layers
>  - do more interesting things
>
> There's one such test in xfstests but it would be good to have more.
>
> Thanks,
> Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Miklos Szeredi Oct. 26, 2016, 12:51 p.m. UTC | #6
On Wed, Oct 26, 2016 at 2:11 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Wed, Oct 26, 2016 at 2:26 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>> On Tue, Oct 25, 2016 at 2:49 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>
>>>> @@ -880,31 +913,34 @@ static int ovl_rename(struct inode *olddir, struct dentry *old,
>>>>         if (WARN_ON(olddentry->d_inode == newdentry->d_inode))
>>>>                 goto out_dput;
>>>>
>>>> -       if (is_dir && !old_opaque && ovl_lower_positive(new)) {
>>>> -               err = ovl_set_opaque(olddentry);
>>>> -               if (err)
>>>> -                       goto out_dput;
>>>> -               ovl_dentry_set_opaque(old, true);
>>>> +       if (is_dir) {
>>>> +               if (ovl_type_merge_or_lower(old)) {
>>>> +                       err = ovl_set_redirect(old);
>>>
>>> There is a fair chance of getting ENOSPC/EDQUOT here and confuse user space.
>>> Would it be better to convert these non fatal errors with EXDEV, so
>>> user space will
>>> gracefully fallback to recursive rename/clone/copy?
>>
>> Recursive copy up will surely consume more space than an xattr?
>
> Surely. But that is not what I meant.
> In Ext4, for example, the total size of extended attributes for an
> inode is limited to a single block,
> so you can get ENOSPC with alot of free space in the file system.

Ah.

>
> You can also get ERANGE if the redirect path length > 256.

The 256 limit is for the name of the xattr, not the value.

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Amir Goldstein Oct. 26, 2016, 7:56 p.m. UTC | #7
On Wed, Oct 26, 2016 at 2:26 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Tue, Oct 25, 2016 at 2:49 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>
>>> @@ -880,31 +913,34 @@ static int ovl_rename(struct inode *olddir, struct dentry *old,
>>>         if (WARN_ON(olddentry->d_inode == newdentry->d_inode))
>>>                 goto out_dput;
>>>
>>> -       if (is_dir && !old_opaque && ovl_lower_positive(new)) {
>>> -               err = ovl_set_opaque(olddentry);
>>> -               if (err)
>>> -                       goto out_dput;
>>> -               ovl_dentry_set_opaque(old, true);
>>> +       if (is_dir) {
>>> +               if (ovl_type_merge_or_lower(old)) {
>>> +                       err = ovl_set_redirect(old);
>>
>> There is a fair chance of getting ENOSPC/EDQUOT here and confuse user space.
>> Would it be better to convert these non fatal errors with EXDEV, so
>> user space will
>> gracefully fallback to recursive rename/clone/copy?
>
> Recursive copy up will surely consume more space than an xattr?
>
>>> @@ -162,6 +223,23 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
>>>                 stack[ctr].dentry = this;
>>>                 stack[ctr].mnt = lowerpath.mnt;
>>>                 ctr++;
>>> +
>>> +               if (!stop && i != poe->numlower - 1 &&
>>> +                   d_is_dir(this) && ovl_redirect_dir(dentry->d_sb)) {
>>> +                       err = ovl_check_redirect(this, &redirect);
>>> +                       if (err)
>>> +                               goto out_put;
>>> +
>>> +                       if (redirect && poe != dentry->d_sb->s_root->d_fsdata) {
>>> +                               poe = dentry->d_sb->s_root->d_fsdata;
>>> +
>>
>> Now you are about to continue looping until new value of poe->numlower,
>> which is >= then olf value of poe->numlower, but 'stack' was allocated
>> according to old value of poe->numlower, so aren't you in danger of
>> overflowing it?
>>
>> Please add a comment to explain the purpose of this loop rewind.
>
> We are jumping to a stack possibly wider than the current one and need
> to find the layer where to continue the downward traversal.  I'll add
> the comment.
>

OK. my point was that you need to allocate an sb max depth stack in advance,
in case you need to jump to a wider stack.

> BTW I don't remember having tested this, so it might possibly be
> buggy.  Automatic multi-layer testing would really be good.  What we
> basically need is:
>
>  - create normal (two layer) overlay (with interesting constructs,
> whiteout, opaque dir, redirect)
>  - umount
>  - create three layer overlay where the two lower layers come from the
> previous upper/lower layers
>  - do more interesting things
>
> There's one such test in xfstests but it would be good to have more.
>

I just sent 2 patches to fix 2 overlayfs xfstests failures.
With these 2 changes, the entire quick test group passes on my setup
(short of one test that also fails on ext4 and xfs).

Now I will start to think about instrumenting generic xfstests with lower/upper
files and then with rotating upper (i.e. layer stack).

Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Raphael Hertzog Oct. 28, 2016, 12:56 p.m. UTC | #8
Hi,

On Wed, 26 Oct 2016, Miklos Szeredi wrote:
> On Tue, Oct 25, 2016 at 1:57 PM, Raphael Hertzog <hertzog@debian.org> wrote:

[ redirect feature enabled by default ]

> I think it would be safe to make it the default if upperdir is empty.
> Nonempty implies that it was created with old kernel (or it was
> crafted by hand).   But there should be a way to explicitly turn it
> off;  either because of the need for backward compatibility or because
> the old format is simply easier to work with for humans.
> 
> How about:
> 
> - If upper is nonempty, then leave redirect feature alone except when
> mount option "-oredirect=on" is used to force enabling it.
> - If upper is empty, then enable redirect feature except when mount
> option "-oredirect=off" is used to force disabling it.

Looks good, but is there some stickyness of the setting? My use case is
schroot and it always starts with an empty upperdir when I create a
temporary/throw-away chroot but if I reboot the machine, then it's possible
that schroot will remount the overlayfs with a non-empty upperdir
(the temporary chroot is preserved until I explicitly end the chroot).

Cheers,
Miklos Szeredi Oct. 28, 2016, 12:59 p.m. UTC | #9
On Fri, Oct 28, 2016 at 2:56 PM, Raphael Hertzog <hertzog@debian.org> wrote:
> Hi,
>
> On Wed, 26 Oct 2016, Miklos Szeredi wrote:
>> On Tue, Oct 25, 2016 at 1:57 PM, Raphael Hertzog <hertzog@debian.org> wrote:
>
> [ redirect feature enabled by default ]
>
>> I think it would be safe to make it the default if upperdir is empty.
>> Nonempty implies that it was created with old kernel (or it was
>> crafted by hand).   But there should be a way to explicitly turn it
>> off;  either because of the need for backward compatibility or because
>> the old format is simply easier to work with for humans.
>>
>> How about:
>>
>> - If upper is nonempty, then leave redirect feature alone except when
>> mount option "-oredirect=on" is used to force enabling it.
>> - If upper is empty, then enable redirect feature except when mount
>> option "-oredirect=off" is used to force disabling it.
>
> Looks good, but is there some stickyness of the setting? My use case is
> schroot and it always starts with an empty upperdir when I create a
> temporary/throw-away chroot but if I reboot the machine, then it's possible
> that schroot will remount the overlayfs with a non-empty upperdir
> (the temporary chroot is preserved until I explicitly end the chroot).

Yes, there is stickiness (it stores the "redirect_dir" feature in an
extended attribute on upperdir).

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Oct. 28, 2016, 4:15 p.m. UTC | #10
On Tue, Oct 25, 2016 at 09:34:47AM +0200, Miklos Szeredi wrote:
> Current code returns EXDEV when a directory would need to be copied up to
> move.  We could copy up the directory tree in this case, but there's
> another solution: point to old lower directory from moved upper directory.
> 
> This is achieved with a "trusted.overlay.redirect" xattr storing the path
> relative to the root of the overlay.  After such attribute has been set,
> the directory can be moved without further actions required.
> 
> This is a backward incompatible feature, old kernels won't be able to
> correctly mount an overlay containing redirected directories.

> +			err = vfs_path_lookup(lowerpath.dentry, lowerpath.mnt,
> +					      redirect, 0, &thispath);
> +
> +			if (err) {
> +				if (err == -ENOENT || err == -ENAMETOOLONG)
> +					this = NULL;
> +			} else {
> +				this = thispath.dentry;
> +				mntput(thispath.mnt);
> +				if (!this->d_inode) {
> +					dput(this);
> +					this = NULL;
> +				} else if (ovl_dentry_weird(this)) {
> +					dput(this);
> +					err = -EREMOTE;
> +				}
> +			}

I'm not happy with that one - you are relying upon the fairly subtle
assertions here.
	1)  Had lowerpath.mnt *not* been a privately cloned one with nothing
mounted on it, you would've been screwed.
	2) Had that thing contained a "jumper" symlink (a-la procfs ones),
you would've been screwed.  Currently only procfs has those, and it would've
been rejected before getting there, but this is brittle and non-obvious.
	3) Any automount point in there (nfs4 referrals, etc.) can
break the assumption that nothing could've been mounted on it.  And _that_
might have not been stepped onto; back when the path had been stored, there'd
been no automount point at all, so we have avoided ovl_dentry_weird() rejects,
and by now nothing on the path had been visited yet, so ovl_dentry_weird()
didn't have a chance to trigger.  Note that calling it on the last dentry
is no good - we might have crossed the automount point in the middle of that
path, so this last dentry might be nice and shiny - and on another filesystem.
So unlike (1) and (2) it's not just a fishy-looking thing that happens to
work for non-local reasons; AFAICS, it's actually a bug.

I'm not sure if vfs_path_lookup() is the right tool here.  It might be
usable for making such a tool, but as it is you are setting one hell of
a trap for yourself...

It might be made to work, if we figure out the right semantics for disabling
symlinks on per-vfsmount basis (and no, the posted nolinks patches are not
it) and mark these private clones with that and with similar "disable
automount traversals" flag (again, needs the right semantics; the area is
convoluted as it is).  But in that case I would strongly recommend adding
an exported wrapper around vfs_path_lookup() that would verify that these
flags *are* set.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Amir Goldstein Oct. 30, 2016, 10 p.m. UTC | #11
On Wed, Oct 26, 2016 at 2:26 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>

...

>
> BTW I don't remember having tested this, so it might possibly be
> buggy.  Automatic multi-layer testing would really be good.  What we
> basically need is:
>
>  - create normal (two layer) overlay (with interesting constructs,
> whiteout, opaque dir, redirect)
>  - umount
>  - create three layer overlay where the two lower layers come from the
> previous upper/lower layers
>  - do more interesting things
>
> There's one such test in xfstests but it would be good to have more.
>

I have a POC for generic addition of layers for unionmount-testsuite
I pushed it to my github:
https://github.com/amir73il/unionmount-testsuite.git #ovl_rename_dir

remount(ctx) rotates upper to top of lower layers and adds a new upper layer
For now, remount(ctx) is called after successful mkdir and rename dir to
catch directory redirect bugs.

As it happens, it is very easy to find a bug, because it is enough to
just umount an mount the overlay without even adding a new layer
after directory rename to corrupt the overlayfs lookup.

Almost every rename-*-dir test demonstrates the problem

Going forward, I intend to hook the remount(ctx) event
in a cleaner way, so for every VFS op in every subtest,
it is possible to configure a test where the operation happens:
- without any remount
- with umount + mount
- with umount + mount with a new layer

Cheers,
Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Miklos Szeredi Oct. 31, 2016, 2:59 p.m. UTC | #12
On Sun, Oct 30, 2016 at 11:00 PM, Amir Goldstein <amir73il@gmail.com> wrote:

> I have a POC for generic addition of layers for unionmount-testsuite
> I pushed it to my github:
> https://github.com/amir73il/unionmount-testsuite.git #ovl_rename_dir

Cool.  But... this seems to be missing the "remount_union" module.

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Amir Goldstein Oct. 31, 2016, 3:02 p.m. UTC | #13
On Mon, Oct 31, 2016 at 4:59 PM, Miklos Szeredi <mszeredi@redhat.com> wrote:
> On Sun, Oct 30, 2016 at 11:00 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>
>> I have a POC for generic addition of layers for unionmount-testsuite
>> I pushed it to my github:
>> https://github.com/amir73il/unionmount-testsuite.git #ovl_rename_dir
>
> Cool.  But... this seems to be missing the "remount_union" module.

Doh! fixes and force pushed

>
> Thanks,
> Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Miklos Szeredi Nov. 3, 2016, 3:50 p.m. UTC | #14
On Fri, Oct 28, 2016 at 6:15 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Oct 25, 2016 at 09:34:47AM +0200, Miklos Szeredi wrote:
>> Current code returns EXDEV when a directory would need to be copied up to
>> move.  We could copy up the directory tree in this case, but there's
>> another solution: point to old lower directory from moved upper directory.
>>
>> This is achieved with a "trusted.overlay.redirect" xattr storing the path
>> relative to the root of the overlay.  After such attribute has been set,
>> the directory can be moved without further actions required.
>>
>> This is a backward incompatible feature, old kernels won't be able to
>> correctly mount an overlay containing redirected directories.
>
>> +                     err = vfs_path_lookup(lowerpath.dentry, lowerpath.mnt,
>> +                                           redirect, 0, &thispath);
>> +
>> +                     if (err) {
>> +                             if (err == -ENOENT || err == -ENAMETOOLONG)
>> +                                     this = NULL;
>> +                     } else {
>> +                             this = thispath.dentry;
>> +                             mntput(thispath.mnt);
>> +                             if (!this->d_inode) {
>> +                                     dput(this);
>> +                                     this = NULL;
>> +                             } else if (ovl_dentry_weird(this)) {
>> +                                     dput(this);
>> +                                     err = -EREMOTE;
>> +                             }
>> +                     }
>
> I'm not happy with that one - you are relying upon the fairly subtle
> assertions here.
>         1)  Had lowerpath.mnt *not* been a privately cloned one with nothing
> mounted on it, you would've been screwed.
>         2) Had that thing contained a "jumper" symlink (a-la procfs ones),
> you would've been screwed.  Currently only procfs has those, and it would've
> been rejected before getting there, but this is brittle and non-obvious.
>         3) Any automount point in there (nfs4 referrals, etc.) can
> break the assumption that nothing could've been mounted on it.  And _that_
> might have not been stepped onto; back when the path had been stored, there'd
> been no automount point at all, so we have avoided ovl_dentry_weird() rejects,
> and by now nothing on the path had been visited yet, so ovl_dentry_weird()
> didn't have a chance to trigger.  Note that calling it on the last dentry
> is no good - we might have crossed the automount point in the middle of that
> path, so this last dentry might be nice and shiny - and on another filesystem.
> So unlike (1) and (2) it's not just a fishy-looking thing that happens to
> work for non-local reasons; AFAICS, it's actually a bug.
>
> I'm not sure if vfs_path_lookup() is the right tool here.  It might be
> usable for making such a tool, but as it is you are setting one hell of
> a trap for yourself...

Agreed, it's not the right tool.   A custom loop of lookup_one_len's
should work much better and doesn't add all that much complexity.
Updated patch pushed to:

git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git #redirect

This version also passes the recycling tests by Amir and enables the
redirect feature by default on an empty upperdir.

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Konstantin Khlebnikov Nov. 6, 2016, 7:14 p.m. UTC | #15
On Wed, Oct 26, 2016 at 2:12 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Tue, Oct 25, 2016 at 1:57 PM, Raphael Hertzog <hertzog@debian.org> wrote:
>
>> Do you plan to make it the default in the future when it has been
>> available for a while?
>>
>> Barring any regression introduced by your patch, it seems that the feature
>> is best available by default since it allows legitimate operations to
>> succeed that are otherwise refused. I understand that it makes it
>> impossible to mount the overlay filesystem with an older kernel but is
>> that problem more widespread than the one we're fixing here?  On my side,
>> overlayfs is only used in scenarios where the kernel is always the same
>> (or newer compared to what created the initial filesystem).
>
> I think it would be safe to make it the default if upperdir is empty.
> Nonempty implies that it was created with old kernel (or it was
> crafted by hand).   But there should be a way to explicitly turn it
> off;  either because of the need for backward compatibility or because
> the old format is simply easier to work with for humans.
>
> How about:
>
> - If upper is nonempty, then leave redirect feature alone except when
> mount option "-oredirect=on" is used to force enabling it.
> - If upper is empty, then enable redirect feature except when mount
> option "-oredirect=off" is used to force disabling it.

I don't like this empty-nonempty upper logic.

I think this feature should be off by default and be enabled
explicitly in mount option.
Available features could be listed in sysfs /sys/fs/overlay/..., like ext4 does.

Overlayfs mounting anyway is complicated operation.
User must know a lot about it and provide persistent state for each mount:
list layers in correct order, work and uppder directory on the same disk, etc.
Enabled features is a part of this state.

Probably this could be solved in userspace tool "mount.overlay" - it could load
features and layers from config file or xattr and set required mount
options automatically.

--
Konstantin
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Miklos Szeredi Nov. 7, 2016, 8:07 a.m. UTC | #16
On Sun, Nov 6, 2016 at 8:14 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> On Wed, Oct 26, 2016 at 2:12 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>> On Tue, Oct 25, 2016 at 1:57 PM, Raphael Hertzog <hertzog@debian.org> wrote:
>>
>>> Do you plan to make it the default in the future when it has been
>>> available for a while?
>>>
>>> Barring any regression introduced by your patch, it seems that the feature
>>> is best available by default since it allows legitimate operations to
>>> succeed that are otherwise refused. I understand that it makes it
>>> impossible to mount the overlay filesystem with an older kernel but is
>>> that problem more widespread than the one we're fixing here?  On my side,
>>> overlayfs is only used in scenarios where the kernel is always the same
>>> (or newer compared to what created the initial filesystem).
>>
>> I think it would be safe to make it the default if upperdir is empty.
>> Nonempty implies that it was created with old kernel (or it was
>> crafted by hand).   But there should be a way to explicitly turn it
>> off;  either because of the need for backward compatibility or because
>> the old format is simply easier to work with for humans.
>>
>> How about:
>>
>> - If upper is nonempty, then leave redirect feature alone except when
>> mount option "-oredirect=on" is used to force enabling it.
>> - If upper is empty, then enable redirect feature except when mount
>> option "-oredirect=off" is used to force disabling it.
>
> I don't like this empty-nonempty upper logic.
>
> I think this feature should be off by default and be enabled
> explicitly in mount option.
> Available features could be listed in sysfs /sys/fs/overlay/..., like ext4 does.
>
> Overlayfs mounting anyway is complicated operation.
> User must know a lot about it and provide persistent state for each mount:
> list layers in correct order, work and uppder directory on the same disk, etc.
> Enabled features is a part of this state.
>
> Probably this could be solved in userspace tool "mount.overlay" - it could load
> features and layers from config file or xattr and set required mount
> options automatically.

It seems there are some conflicting opinions here.  I have mine too,
but I'll let this simmer and concentrate on actual features for now.

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Konstantin Khlebnikov Nov. 7, 2016, 9:58 a.m. UTC | #17
On Mon, Nov 7, 2016 at 11:07 AM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Sun, Nov 6, 2016 at 8:14 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>> On Wed, Oct 26, 2016 at 2:12 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>>> On Tue, Oct 25, 2016 at 1:57 PM, Raphael Hertzog <hertzog@debian.org> wrote:
>>>
>>>> Do you plan to make it the default in the future when it has been
>>>> available for a while?
>>>>
>>>> Barring any regression introduced by your patch, it seems that the feature
>>>> is best available by default since it allows legitimate operations to
>>>> succeed that are otherwise refused. I understand that it makes it
>>>> impossible to mount the overlay filesystem with an older kernel but is
>>>> that problem more widespread than the one we're fixing here?  On my side,
>>>> overlayfs is only used in scenarios where the kernel is always the same
>>>> (or newer compared to what created the initial filesystem).
>>>
>>> I think it would be safe to make it the default if upperdir is empty.
>>> Nonempty implies that it was created with old kernel (or it was
>>> crafted by hand).   But there should be a way to explicitly turn it
>>> off;  either because of the need for backward compatibility or because
>>> the old format is simply easier to work with for humans.
>>>
>>> How about:
>>>
>>> - If upper is nonempty, then leave redirect feature alone except when
>>> mount option "-oredirect=on" is used to force enabling it.
>>> - If upper is empty, then enable redirect feature except when mount
>>> option "-oredirect=off" is used to force disabling it.
>>
>> I don't like this empty-nonempty upper logic.
>>
>> I think this feature should be off by default and be enabled
>> explicitly in mount option.
>> Available features could be listed in sysfs /sys/fs/overlay/..., like ext4 does.
>>
>> Overlayfs mounting anyway is complicated operation.
>> User must know a lot about it and provide persistent state for each mount:
>> list layers in correct order, work and uppder directory on the same disk, etc.
>> Enabled features is a part of this state.
>>
>> Probably this could be solved in userspace tool "mount.overlay" - it could load
>> features and layers from config file or xattr and set required mount
>> options automatically.
>
> It seems there are some conflicting opinions here.  I have mine too,
> but I'll let this simmer and concentrate on actual features for now.

Ok =) let's try keep as simple as possible.

I've stumbled on somehow related problem - concurrent copy-ups are
strictly serialized by rename locks.
Obviously, file copying could be done in parallel: locks are required
only for final rename.
Because of that overlay slower that aufs for some workloads.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Miklos Szeredi Nov. 7, 2016, 10:04 a.m. UTC | #18
On Mon, Nov 7, 2016 at 10:58 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:

> I've stumbled on somehow related problem - concurrent copy-ups are
> strictly serialized by rename locks.
> Obviously, file copying could be done in parallel: locks are required
> only for final rename.
> Because of that overlay slower that aufs for some workloads.

Easy to fix: for each copy up create a separate subdir of "work".
Then the contention is only for the time of creating the subdir, which
is very short.

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Konstantin Khlebnikov Nov. 7, 2016, 10:08 a.m. UTC | #19
On Mon, Nov 7, 2016 at 1:04 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Mon, Nov 7, 2016 at 10:58 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>
>> I've stumbled on somehow related problem - concurrent copy-ups are
>> strictly serialized by rename locks.
>> Obviously, file copying could be done in parallel: locks are required
>> only for final rename.
>> Because of that overlay slower that aufs for some workloads.
>
> Easy to fix: for each copy up create a separate subdir of "work".
> Then the contention is only for the time of creating the subdir, which
> is very short.

Yeah, but lock_rename() also takes per-sb s_vfs_rename_mutex (kludge by Al Viro)
I think proper synchronization for concurrent copy-up (for example
round flag on ovl_entry) and  locking rename only for rename could be
better.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Raphael Hertzog Nov. 7, 2016, 11:03 a.m. UTC | #20
Hello,

On Sun, 06 Nov 2016, Konstantin Khlebnikov wrote:
> > - If upper is nonempty, then leave redirect feature alone except when
> > mount option "-oredirect=on" is used to force enabling it.
> > - If upper is empty, then enable redirect feature except when mount
> > option "-oredirect=off" is used to force disabling it.
> 
> I don't like this empty-nonempty upper logic.

Why? (I don't have the feeling that your subsequent paragraphs answer this
question... unless "overlayfs mounting is hard, let's complicate it even
more" is your answer)

> I think this feature should be off by default and be enabled
> explicitly in mount option.
> Available features could be listed in sysfs /sys/fs/overlay/..., like ext4 does.

TTBOMK in ext4, they are set at mkfs time and the default feature set
comes from /etc/mke2fs.conf. There's nothing like that for overlayfs.

> Overlayfs mounting anyway is complicated operation.
> User must know a lot about it and provide persistent state for each mount:
> list layers in correct order, work and uppder directory on the same disk, etc.
> Enabled features is a part of this state.

A large part of the users are not direct overlayfs users, they use
application (like schroot and live-build in my case) that rely on
overlayfs to offer some user-modifiable throw-away chroots or some
persistency on top of a read-only image. In both cases, the upper
directory start empty.

I find it highly disturbing to have to modify all those applications just
to get the correct semantics to rename a directory.

> Probably this could be solved in userspace tool "mount.overlay" - it could load
> features and layers from config file or xattr and set required mount
> options automatically.

I'm all for having a better API to mount overlayfs, but I don't think
blocking on the "redirect=on" by default is a good way to get this.

Cheers,
Konstantin Khlebnikov Nov. 7, 2016, 11:31 a.m. UTC | #21
On Mon, Nov 7, 2016 at 2:03 PM, Raphael Hertzog <hertzog@debian.org> wrote:
> Hello,
>
> On Sun, 06 Nov 2016, Konstantin Khlebnikov wrote:
>> > - If upper is nonempty, then leave redirect feature alone except when
>> > mount option "-oredirect=on" is used to force enabling it.
>> > - If upper is empty, then enable redirect feature except when mount
>> > option "-oredirect=off" is used to force disabling it.
>>
>> I don't like this empty-nonempty upper logic.
>
> Why? (I don't have the feeling that your subsequent paragraphs answer this
> question... unless "overlayfs mounting is hard, let's complicate it even
> more" is your answer)

Mixing flags from mount options, xattrs and emptiness of upper layer
doesn't make it simpler.

We have clear statement that options in /proc/mounts describes overlay
instance, let's keep feeding state in this way.

>
>> I think this feature should be off by default and be enabled
>> explicitly in mount option.
>> Available features could be listed in sysfs /sys/fs/overlay/..., like ext4 does.
>
> TTBOMK in ext4, they are set at mkfs time and the default feature set
> comes from /etc/mke2fs.conf. There's nothing like that for overlayfs.

/etc/mke2fs.conf is used by mkfs.ext4 - state is saved in super-block
inside filesystem.

overlayfs have no persistent super-block.

>
>> Overlayfs mounting anyway is complicated operation.
>> User must know a lot about it and provide persistent state for each mount:
>> list layers in correct order, work and uppder directory on the same disk, etc.
>> Enabled features is a part of this state.
>
> A large part of the users are not direct overlayfs users, they use
> application (like schroot and live-build in my case) that rely on
> overlayfs to offer some user-modifiable throw-away chroots or some
> persistency on top of a read-only image. In both cases, the upper
> directory start empty.
>
> I find it highly disturbing to have to modify all those applications just
> to get the correct semantics to rename a directory.

Other application are already aware about overlay layout and use it.
We cannot enable by default new backward incompatible features.

Returning -EXDEV is a completely correct semantics for rename,
most applications employ broken assumptions about this syscall =)

>
>> Probably this could be solved in userspace tool "mount.overlay" - it could load
>> features and layers from config file or xattr and set required mount
>> options automatically.
>
> I'm all for having a better API to mount overlayfs, but I don't think
> blocking on the "redirect=on" by default is a good way to get this.
>
> Cheers,
> --
> Raphaël Hertzog ◈ Debian Developer
>
> Support Debian LTS: http://www.freexian.com/services/debian-lts.html
> Learn to master Debian: http://debian-handbook.info/get/
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Amir Goldstein Nov. 7, 2016, 1:38 p.m. UTC | #22
On Mon, Nov 7, 2016 at 12:08 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> On Mon, Nov 7, 2016 at 1:04 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>> On Mon, Nov 7, 2016 at 10:58 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>>
>>> I've stumbled on somehow related problem - concurrent copy-ups are
>>> strictly serialized by rename locks.
>>> Obviously, file copying could be done in parallel: locks are required
>>> only for final rename.
>>> Because of that overlay slower that aufs for some workloads.
>>
>> Easy to fix: for each copy up create a separate subdir of "work".
>> Then the contention is only for the time of creating the subdir, which
>> is very short.
>
> Yeah, but lock_rename() also takes per-sb s_vfs_rename_mutex (kludge by Al Viro)
> I think proper synchronization for concurrent copy-up (for example
> round flag on ovl_entry) and  locking rename only for rename could be
> better.

Removing s_vfs_rename_mutex from copy-up path is something I have been
pondering about.
Assuming that I understand Al's comment above vfs_rename() correctly,
the sole purpose of per-sb serialization is to prevent loop creations.
However, how can one create a loop by moving a non-directory?
So it looks like at least for the non-dir copy up case, a much finer grained
lock is in order.

Anyway, it's on my todo list, as concurrent operation performance on overlayfs
is important to out use case.

Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Raphael Hertzog Nov. 7, 2016, 1:42 p.m. UTC | #23
On Mon, 07 Nov 2016, Konstantin Khlebnikov wrote:
> > Why? (I don't have the feeling that your subsequent paragraphs answer this
> > question... unless "overlayfs mounting is hard, let's complicate it even
> > more" is your answer)
> 
> Mixing flags from mount options, xattrs and emptiness of upper layer
> doesn't make it simpler.

It depends for whom. It does make it simpler for applications that just
want overlayfs to work like other normal filesystems.

> We have clear statement that options in /proc/mounts describes overlay
> instance, let's keep feeding state in this way.

Didn't you say above that xattrs provide flags too?

> > I find it highly disturbing to have to modify all those applications just
> > to get the correct semantics to rename a directory.
> 
> Other application are already aware about overlay layout and use it.
> We cannot enable by default new backward incompatible features.

On the opposite, if we have to modify those applications to add a new
mount option, then they will no longer work with older versions of
overlayfs... so you move the complexity down to applications if they want
to work with multiple kernel versions.

There's no technical problem to enable a new backward incompatible
feature. It's just that you don't want to do it in case the user
wants to mount it again with an older kernel. So this is just policy.

So what about a new mount option that defines a compatibility level?

0: initial feature set
1: with renamedir flag

It would default to 1 but the user can set it to "0" to keep compatibility
with older versions of overlayfs. In the future, as more backward
incompatible changes are added, you add new levels and define the values
of the various flags based on this setting.

> Returning -EXDEV is a completely correct semantics for rename,
> most applications employ broken assumptions about this syscall =)

Maybe (I don't know what standards say), but then what matters is
real-life. And in real-life, it's somewhat unexpected to get back -EXDEV
when the rename() happens in the same directory (and has therefore no
chance to cross any mount boundary).

Cheers,
Miklos Szeredi Nov. 10, 2016, 10:39 p.m. UTC | #24
New version is at:

git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git #redirect

News:
- it actually should work in all cases
- when rename is not cross directory, just store the new name instead
of a full path, as suggested by Amir
- when redirect path is too long fall back to EXDEV (the max length
should probably be a module param)

About turning the feature on or off.  Yes, maybe the empty checking is
too complicated.  Going one simpler:

 - default to old behavior, turn on with mount option
 - add module option and kernel compile option to turn on the feature by default

I guess distros wil simply enable this by default, since back
compatibility is basically a non-issue.

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Amir Goldstein Nov. 10, 2016, 10:56 p.m. UTC | #25
On Mon, Nov 7, 2016 at 3:38 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Mon, Nov 7, 2016 at 12:08 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>> On Mon, Nov 7, 2016 at 1:04 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>>> On Mon, Nov 7, 2016 at 10:58 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>>>
>>>> I've stumbled on somehow related problem - concurrent copy-ups are
>>>> strictly serialized by rename locks.
>>>> Obviously, file copying could be done in parallel: locks are required
>>>> only for final rename.
>>>> Because of that overlay slower that aufs for some workloads.
>>>
>>> Easy to fix: for each copy up create a separate subdir of "work".
>>> Then the contention is only for the time of creating the subdir, which
>>> is very short.
>>
>> Yeah, but lock_rename() also takes per-sb s_vfs_rename_mutex (kludge by Al Viro)
>> I think proper synchronization for concurrent copy-up (for example
>> round flag on ovl_entry) and  locking rename only for rename could be
>> better.
>
> Removing s_vfs_rename_mutex from copy-up path is something I have been
> pondering about.
> Assuming that I understand Al's comment above vfs_rename() correctly,
> the sole purpose of per-sb serialization is to prevent loop creations.
> However, how can one create a loop by moving a non-directory?
> So it looks like at least for the non-dir copy up case, a much finer grained
> lock is in order.
>


I posted patches to relax the s_vfs_rename_mutex for copy-up and
whiteout in some use cases.

Konstantin,

It would be useful to know if those patches help with your use case.

Thanks,
Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Konstantin Khlebnikov Nov. 11, 2016, 9:41 a.m. UTC | #26
On Fri, Nov 11, 2016 at 1:39 AM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> New version is at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git #redirect
>
> News:
> - it actually should work in all cases
> - when rename is not cross directory, just store the new name instead
> of a full path, as suggested by Amir
> - when redirect path is too long fall back to EXDEV (the max length
> should probably be a module param)
>
> About turning the feature on or off.  Yes, maybe the empty checking is
> too complicated.  Going one simpler:
>
>  - default to old behavior, turn on with mount option
>  - add module option and kernel compile option to turn on the feature by default
>
> I guess distros wil simply enable this by default, since back
> compatibility is basically a non-issue.

Looks good.

I suppose module parameter exposed in /sys/module/overlay/parameters/ could be
documented as recommended way for detecting presence of that feature.

--
Konstantin
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Konstantin Khlebnikov Nov. 11, 2016, 9:46 a.m. UTC | #27
On Fri, Nov 11, 2016 at 1:56 AM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Mon, Nov 7, 2016 at 3:38 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>> On Mon, Nov 7, 2016 at 12:08 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>>> On Mon, Nov 7, 2016 at 1:04 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>>>> On Mon, Nov 7, 2016 at 10:58 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>>>>
>>>>> I've stumbled on somehow related problem - concurrent copy-ups are
>>>>> strictly serialized by rename locks.
>>>>> Obviously, file copying could be done in parallel: locks are required
>>>>> only for final rename.
>>>>> Because of that overlay slower that aufs for some workloads.
>>>>
>>>> Easy to fix: for each copy up create a separate subdir of "work".
>>>> Then the contention is only for the time of creating the subdir, which
>>>> is very short.
>>>
>>> Yeah, but lock_rename() also takes per-sb s_vfs_rename_mutex (kludge by Al Viro)
>>> I think proper synchronization for concurrent copy-up (for example
>>> round flag on ovl_entry) and  locking rename only for rename could be
>>> better.
>>
>> Removing s_vfs_rename_mutex from copy-up path is something I have been
>> pondering about.
>> Assuming that I understand Al's comment above vfs_rename() correctly,
>> the sole purpose of per-sb serialization is to prevent loop creations.
>> However, how can one create a loop by moving a non-directory?
>> So it looks like at least for the non-dir copy up case, a much finer grained
>> lock is in order.
>>
>
>
> I posted patches to relax the s_vfs_rename_mutex for copy-up and
> whiteout in some use cases.
>
> Konstantin,
>
> It would be useful to know if those patches help with your use case.
>

Well.. I think relaxing only s_vfs_rename_mutex wouldn't help much here.
Copying is still serialized by i_mutex on workdir?
Data copying should be done without rename locks at all.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Miklos Szeredi Nov. 11, 2016, 10:06 a.m. UTC | #28
On Fri, Nov 11, 2016 at 10:46 AM, Konstantin Khlebnikov
<koct9i@gmail.com> wrote:
> On Fri, Nov 11, 2016 at 1:56 AM, Amir Goldstein <amir73il@gmail.com> wrote:
>> On Mon, Nov 7, 2016 at 3:38 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>> On Mon, Nov 7, 2016 at 12:08 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>>>> On Mon, Nov 7, 2016 at 1:04 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>>>>> On Mon, Nov 7, 2016 at 10:58 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>>>>>
>>>>>> I've stumbled on somehow related problem - concurrent copy-ups are
>>>>>> strictly serialized by rename locks.
>>>>>> Obviously, file copying could be done in parallel: locks are required
>>>>>> only for final rename.
>>>>>> Because of that overlay slower that aufs for some workloads.
>>>>>
>>>>> Easy to fix: for each copy up create a separate subdir of "work".
>>>>> Then the contention is only for the time of creating the subdir, which
>>>>> is very short.
>>>>
>>>> Yeah, but lock_rename() also takes per-sb s_vfs_rename_mutex (kludge by Al Viro)
>>>> I think proper synchronization for concurrent copy-up (for example
>>>> round flag on ovl_entry) and  locking rename only for rename could be
>>>> better.
>>>
>>> Removing s_vfs_rename_mutex from copy-up path is something I have been
>>> pondering about.
>>> Assuming that I understand Al's comment above vfs_rename() correctly,
>>> the sole purpose of per-sb serialization is to prevent loop creations.
>>> However, how can one create a loop by moving a non-directory?
>>> So it looks like at least for the non-dir copy up case, a much finer grained
>>> lock is in order.
>>>
>>
>>
>> I posted patches to relax the s_vfs_rename_mutex for copy-up and
>> whiteout in some use cases.
>>
>> Konstantin,
>>
>> It would be useful to know if those patches help with your use case.
>>
>
> Well.. I think relaxing only s_vfs_rename_mutex wouldn't help much here.
> Copying is still serialized by i_mutex on workdir?
> Data copying should be done without rename locks at all.

We do need something to prevent multiple copy-ups starting up in
parallel on the same file, though.

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Amir Goldstein Nov. 11, 2016, 12:42 p.m. UTC | #29
On Fri, Nov 11, 2016 at 12:06 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Fri, Nov 11, 2016 at 10:46 AM, Konstantin Khlebnikov
> <koct9i@gmail.com> wrote:
>> On Fri, Nov 11, 2016 at 1:56 AM, Amir Goldstein <amir73il@gmail.com> wrote:
>>> On Mon, Nov 7, 2016 at 3:38 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>>> On Mon, Nov 7, 2016 at 12:08 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>>>>> On Mon, Nov 7, 2016 at 1:04 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>>>>>> On Mon, Nov 7, 2016 at 10:58 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>>>>>>
>>>>>>> I've stumbled on somehow related problem - concurrent copy-ups are
>>>>>>> strictly serialized by rename locks.
>>>>>>> Obviously, file copying could be done in parallel: locks are required
>>>>>>> only for final rename.
>>>>>>> Because of that overlay slower that aufs for some workloads.
>>>>>>
>>>>>> Easy to fix: for each copy up create a separate subdir of "work".
>>>>>> Then the contention is only for the time of creating the subdir, which
>>>>>> is very short.
>>>>>
>>>>> Yeah, but lock_rename() also takes per-sb s_vfs_rename_mutex (kludge by Al Viro)
>>>>> I think proper synchronization for concurrent copy-up (for example
>>>>> round flag on ovl_entry) and  locking rename only for rename could be
>>>>> better.
>>>>
>>>> Removing s_vfs_rename_mutex from copy-up path is something I have been
>>>> pondering about.
>>>> Assuming that I understand Al's comment above vfs_rename() correctly,
>>>> the sole purpose of per-sb serialization is to prevent loop creations.
>>>> However, how can one create a loop by moving a non-directory?
>>>> So it looks like at least for the non-dir copy up case, a much finer grained
>>>> lock is in order.
>>>>
>>>
>>>
>>> I posted patches to relax the s_vfs_rename_mutex for copy-up and
>>> whiteout in some use cases.
>>>
>>> Konstantin,
>>>
>>> It would be useful to know if those patches help with your use case.
>>>
>>
>> Well.. I think relaxing only s_vfs_rename_mutex wouldn't help much here.
>> Copying is still serialized by i_mutex on workdir?
>> Data copying should be done without rename locks at all.
>
> We do need something to prevent multiple copy-ups starting up in
> parallel on the same file, though.
>

I guess an inode_lock on the copy-up victim should suffice?
I will look into it as soon as I am done with profiling.
So far I ran only 2 rm -rf threads on 2 different overlay mounts
on the same underlying fs
and s_vfs_rename_mutex was contended about ~4% of the time.

In this test, copy-up is not dominant - only ~2% for the directory
copy-ups, but vfs_whiteouts take 20% and the vfs_rename itself 10%,
both with s_vfs_rename_mutex held.

Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Amir Goldstein Nov. 13, 2016, 9:11 a.m. UTC | #30
On Fri, Nov 11, 2016 at 2:42 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Fri, Nov 11, 2016 at 12:06 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>> On Fri, Nov 11, 2016 at 10:46 AM, Konstantin Khlebnikov
>> <koct9i@gmail.com> wrote:
>>> On Fri, Nov 11, 2016 at 1:56 AM, Amir Goldstein <amir73il@gmail.com> wrote:
>>>> On Mon, Nov 7, 2016 at 3:38 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>>>> On Mon, Nov 7, 2016 at 12:08 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>>>>>> On Mon, Nov 7, 2016 at 1:04 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>>>>>>> On Mon, Nov 7, 2016 at 10:58 AM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>>>>>>>
>>>>>>>> I've stumbled on somehow related problem - concurrent copy-ups are
>>>>>>>> strictly serialized by rename locks.
>>>>>>>> Obviously, file copying could be done in parallel: locks are required
>>>>>>>> only for final rename.
>>>>>>>> Because of that overlay slower that aufs for some workloads.
>>>>>>>
>>>>>>> Easy to fix: for each copy up create a separate subdir of "work".
>>>>>>> Then the contention is only for the time of creating the subdir, which
>>>>>>> is very short.
>>>>>>
>>>>>> Yeah, but lock_rename() also takes per-sb s_vfs_rename_mutex (kludge by Al Viro)
>>>>>> I think proper synchronization for concurrent copy-up (for example
>>>>>> round flag on ovl_entry) and  locking rename only for rename could be
>>>>>> better.
>>>>>
>>>>> Removing s_vfs_rename_mutex from copy-up path is something I have been
>>>>> pondering about.
>>>>> Assuming that I understand Al's comment above vfs_rename() correctly,
>>>>> the sole purpose of per-sb serialization is to prevent loop creations.
>>>>> However, how can one create a loop by moving a non-directory?
>>>>> So it looks like at least for the non-dir copy up case, a much finer grained
>>>>> lock is in order.
>>>>>
>>>>
>>>>
>>>> I posted patches to relax the s_vfs_rename_mutex for copy-up and
>>>> whiteout in some use cases.
>>>>
>>>> Konstantin,
>>>>
>>>> It would be useful to know if those patches help with your use case.
>>>>
>>>
>>> Well.. I think relaxing only s_vfs_rename_mutex wouldn't help much here.
>>> Copying is still serialized by i_mutex on workdir?
>>> Data copying should be done without rename locks at all.
>>
>> We do need something to prevent multiple copy-ups starting up in
>> parallel on the same file, though.
>>
>
> I guess an inode_lock on the copy-up victim should suffice?

Just to follow up on this hijacked thread.
I posted patches to lock the overlay inode of copied up file
and relaxed the lock_rename during data copy up.

Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Amir Goldstein Nov. 13, 2016, 10 a.m. UTC | #31
On Fri, Nov 11, 2016 at 12:39 AM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> New version is at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git #redirect
>
> News:
> - it actually should work in all cases
> - when rename is not cross directory, just store the new name instead
> of a full path, as suggested by Amir
> - when redirect path is too long fall back to EXDEV (the max length
> should probably be a module param)
>

Looks goods, except for the case of change from relative to absolute
redirect of the victim dentry. IIUC, ovl_set_redirect() will return immediately
because ovl_dentry_is_redirect() and will not get to setting the absolute
redirect.

It passed my sanity tests (including recycle test) and on top of my copy up
lock changes.

You can add Reviewed-by/Tested-by me to
  1380846 ovl: redirect on rename-dir

> About turning the feature on or off.  Yes, maybe the empty checking is
> too complicated.  Going one simpler:
>
>  - default to old behavior, turn on with mount option
>  - add module option and kernel compile option to turn on the feature by default
>

I don't see these changes on your branch.
Are these your plans for a future version?

Thanks,
Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Amir Goldstein Nov. 14, 2016, 4:25 p.m. UTC | #32
On Sun, Nov 13, 2016 at 12:00 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Fri, Nov 11, 2016 at 12:39 AM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>> New version is at:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git #redirect
>>
>> News:
>> - it actually should work in all cases
>> - when rename is not cross directory, just store the new name instead
>> of a full path, as suggested by Amir
>> - when redirect path is too long fall back to EXDEV (the max length
>> should probably be a module param)
>>
>
> Looks goods, except for the case of change from relative to absolute
> redirect of the victim dentry. IIUC, ovl_set_redirect() will return immediately
> because ovl_dentry_is_redirect() and will not get to setting the absolute
> redirect.
>

I added some more tests to catch this problem at:
https://github.com/amir73il/unionmount-testsuite.git #ovl_rename_dir

In the new version, upper recycling is optional and you can set a bound
to the number of layers. This is needed to catch the bug because the
scenario is:
- Rename populated dir in same dir
- Move the populated dir to another dir
- Re-mount
- Populated dir is empty

The following test run demonstrates it:

$ sudo ./run --ov=0 rename-move-dir
...
TEST rename-move-dir.py:37: Rename populated dir and move into another
 ./run --rename /mnt/a/dir102 /mnt/a/dir102x
 ./run --rename /mnt/a/dir102 /mnt/a/dir102x -E ENOENT
 ./run --rename /mnt/a/dir102x /mnt/a/empty102/dir102
 ./run --rename /mnt/a/dir102x /mnt/a/empty102/dir102 -E ENOENT
 ./run --open-file /mnt/a/dir102 -r -d -E ENOENT
 ./run --open-file /mnt/a/dir102x -r -d -E ENOENT
 ./run --open-file /mnt/a/empty102/dir102 -r -d
/mnt/a/empty102/dir102/a: File is missing


Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Miklos Szeredi Nov. 16, 2016, 10 p.m. UTC | #33
On Mon, Nov 14, 2016 at 5:25 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Sun, Nov 13, 2016 at 12:00 PM, Amir Goldstein <amir73il@gmail.com> wrote:

>> Looks goods, except for the case of change from relative to absolute
>> redirect of the victim dentry. IIUC, ovl_set_redirect() will return immediately
>> because ovl_dentry_is_redirect() and will not get to setting the absolute
>> redirect.
>>
>
> I added some more tests to catch this problem at:
> https://github.com/amir73il/unionmount-testsuite.git #ovl_rename_dir

Thanks for testing.

Force pushed updated version to the usual place:

   git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git #redirect

This also has the xattr feature thing replaced with mount option,
module param and kernel config option.

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/Documentation/filesystems/overlayfs.txt b/Documentation/filesystems/overlayfs.txt
index 5108425157ac..fae48ab3b36b 100644
--- a/Documentation/filesystems/overlayfs.txt
+++ b/Documentation/filesystems/overlayfs.txt
@@ -130,6 +130,23 @@  directory.
 Readdir on directories that are not merged is simply handled by the
 underlying directory (upper or lower).
 
+renaming directories
+--------------------
+
+When renaming a directory that is on the lower layer or merged (i.e. the
+directory was not created on the upper layer to start with) overlayfs can
+handle it in two different ways:
+
+1) return EXDEV error: this error is returned by rename(2) when trying to
+   move a file or directory across filesystem boundaries.  Hence
+   applications are usually prepared to hande this error (mv(1) for example
+   recursively copies the directory tree).  This is the default behavior.
+
+2) If the "redirect_dir" feature is enabled, then the directory will be
+   copied up (but not the contents).  Then the "trusted.overlay.redirect"
+   extended attribute is set to the path of the original location from the
+   root of the overlay.  Finally the directory is moved to the new
+   location.
 
 Non-directories
 ---------------
@@ -201,8 +218,8 @@  If a file with multiple hard links is copied up, then this will
 "break" the link.  Changes will not be propagated to other names
 referring to the same inode.
 
-Directory trees are not copied up.  If rename(2) is performed on a directory
-which is on the lower layer or is merged, then -EXDEV will be returned.
+Unless "redirect_dir" feature is enabled, rename(2) on a lower or merged
+directory will fail with EXDEV.
 
 Changes to underlying filesystems
 ---------------------------------
diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index 49f6158bb04c..0d7075208099 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -323,17 +323,11 @@  static int ovl_copy_up_locked(struct dentry *workdir, struct dentry *upperdir,
 /*
  * Copy up a single dentry
  *
- * Directory renames only allowed on "pure upper" (already created on
- * upper filesystem, never copied up).  Directories which are on lower or
- * are merged may not be renamed.  For these -EXDEV is returned and
- * userspace has to deal with it.  This means, when copying up a
- * directory we can rely on it and ancestors being stable.
- *
- * Non-directory renames start with copy up of source if necessary.  The
- * actual rename will only proceed once the copy up was successful.  Copy
- * up uses upper parent i_mutex for exclusion.  Since rename can change
- * d_parent it is possible that the copy up will lock the old parent.  At
- * that point the file will have already been copied up anyway.
+ * All renames start with copy up of source if necessary.  The actual
+ * rename will only proceed once the copy up was successful.  Copy up uses
+ * upper parent i_mutex for exclusion.  Since rename can change d_parent it
+ * is possible that the copy up will lock the old parent.  At that point
+ * the file will have already been copied up anyway.
  */
 int ovl_copy_up_one(struct dentry *parent, struct dentry *dentry,
 		    struct path *lowerpath, struct kstat *stat)
@@ -345,7 +339,6 @@  int ovl_copy_up_one(struct dentry *parent, struct dentry *dentry,
 	struct path parentpath;
 	struct dentry *lowerdentry = lowerpath->dentry;
 	struct dentry *upperdir;
-	struct dentry *upperdentry;
 	const char *link = NULL;
 
 	if (WARN_ON(!workdir))
@@ -371,8 +364,7 @@  int ovl_copy_up_one(struct dentry *parent, struct dentry *dentry,
 		pr_err("overlayfs: failed to lock workdir+upperdir\n");
 		goto out_unlock;
 	}
-	upperdentry = ovl_dentry_upper(dentry);
-	if (upperdentry) {
+	if (ovl_dentry_upper(dentry)) {
 		/* Raced with another copy-up?  Nothing to do, then... */
 		err = 0;
 		goto out_unlock;
diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index 2c1057d747cb..065e0211f9b6 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -757,6 +757,40 @@  static bool ovl_type_merge_or_lower(struct dentry *dentry)
 	return OVL_TYPE_MERGE(type) || !OVL_TYPE_UPPER(type);
 }
 
+static bool ovl_can_move(struct dentry *dentry)
+{
+	return ovl_redirect_dir(dentry->d_sb) ||
+		!d_is_dir(dentry) || !ovl_type_merge_or_lower(dentry);
+}
+
+static int ovl_set_redirect(struct dentry *dentry)
+{
+	char *buf;
+	char *path;
+	int err;
+
+	if (ovl_dentry_is_redirect(dentry))
+		return 0;
+
+	buf = __getname();
+	if (!buf)
+		return -ENOMEM;
+
+	path = dentry_path_raw(dentry, buf, PATH_MAX);
+	err = PTR_ERR(path);
+	if (IS_ERR(path))
+		goto putname;
+
+	err = ovl_do_setxattr(ovl_dentry_upper(dentry), OVL_XATTR_REDIRECT,
+			      path, strlen(path), 0);
+putname:
+	__putname(buf);
+	if (!err)
+		ovl_dentry_set_redirect(dentry);
+
+	return err;
+}
+
 static int ovl_rename(struct inode *olddir, struct dentry *old,
 		      struct inode *newdir, struct dentry *new,
 		      unsigned int flags)
@@ -784,9 +818,9 @@  static int ovl_rename(struct inode *olddir, struct dentry *old,
 
 	/* Don't copy up directory trees */
 	err = -EXDEV;
-	if (is_dir && ovl_type_merge_or_lower(old))
+	if (!ovl_can_move(old))
 		goto out;
-	if (!overwrite && new_is_dir && ovl_type_merge_or_lower(new))
+	if (!overwrite && !ovl_can_move(new))
 		goto out;
 
 	err = ovl_want_write(old);
@@ -837,7 +871,6 @@  static int ovl_rename(struct inode *olddir, struct dentry *old,
 
 	trap = lock_rename(new_upperdir, old_upperdir);
 
-
 	olddentry = lookup_one_len(old->d_name.name, old_upperdir,
 				   old->d_name.len);
 	err = PTR_ERR(olddentry);
@@ -880,31 +913,34 @@  static int ovl_rename(struct inode *olddir, struct dentry *old,
 	if (WARN_ON(olddentry->d_inode == newdentry->d_inode))
 		goto out_dput;
 
-	if (is_dir && !old_opaque && ovl_lower_positive(new)) {
-		err = ovl_set_opaque(olddentry);
-		if (err)
-			goto out_dput;
-		ovl_dentry_set_opaque(old, true);
+	if (is_dir) {
+		if (ovl_type_merge_or_lower(old)) {
+			err = ovl_set_redirect(old);
+			if (err)
+				goto out_dput;
+		} else if (!old_opaque && ovl_lower_positive(new)) {
+			err = ovl_set_opaque(olddentry);
+			if (err)
+				goto out_dput;
+			ovl_dentry_set_opaque(old, true);
+		}
 	}
-	if (!overwrite &&
-	    new_is_dir && !new_opaque && ovl_lower_positive(old)) {
-		err = ovl_set_opaque(newdentry);
-		if (err)
-			goto out_dput;
-		ovl_dentry_set_opaque(new, true);
+	if (!overwrite && new_is_dir) {
+		if (ovl_type_merge_or_lower(new)) {
+			err = ovl_set_redirect(new);
+			if (err)
+				goto out_dput;
+		} else if (!new_opaque && ovl_lower_positive(old)) {
+			err = ovl_set_opaque(newdentry);
+			if (err)
+				goto out_dput;
+			ovl_dentry_set_opaque(new, true);
+		}
 	}
 
-	if (old_opaque || new_opaque) {
-		err = ovl_do_rename(old_upperdir->d_inode, olddentry,
-				    new_upperdir->d_inode, newdentry,
-				    flags);
-	} else {
-		/* No debug for the plain case */
-		BUG_ON(flags & ~RENAME_EXCHANGE);
-		err = vfs_rename(old_upperdir->d_inode, olddentry,
-				 new_upperdir->d_inode, newdentry,
-				 NULL, flags);
-	}
+	err = ovl_do_rename(old_upperdir->d_inode, olddentry,
+			    new_upperdir->d_inode, newdentry,
+			    flags);
 	if (err)
 		goto out_dput;
 
diff --git a/fs/overlayfs/namei.c b/fs/overlayfs/namei.c
index f4057fcb0246..c7cacbb8ce09 100644
--- a/fs/overlayfs/namei.c
+++ b/fs/overlayfs/namei.c
@@ -8,6 +8,7 @@ 
  */
 
 #include <linux/fs.h>
+#include <linux/mount.h>
 #include <linux/namei.h>
 #include <linux/xattr.h>
 #include "overlayfs.h"
@@ -48,6 +49,28 @@  static bool ovl_is_opaquedir(struct dentry *dentry)
 	return false;
 }
 
+static int ovl_check_redirect(struct dentry *dentry, char **bufp)
+{
+	int res;
+
+	res = vfs_getxattr(dentry, OVL_XATTR_REDIRECT, NULL, 0);
+	if (res > 0) {
+		char *buf = kzalloc(res + 1, GFP_KERNEL);
+
+		if (!buf)
+			return -ENOMEM;
+
+		res = vfs_getxattr(dentry, OVL_XATTR_REDIRECT, buf, res);
+		if (res < 0)
+			return res;
+
+		kfree(*bufp);
+		*bufp = buf;
+	}
+
+	return 0;
+}
+
 /*
  * Returns next layer in stack starting from top.
  * Returns -1 if this is the last layer.
@@ -80,11 +103,13 @@  struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
 	unsigned int ctr = 0;
 	struct inode *inode = NULL;
 	bool upperopaque = false;
+	bool upperredirect = false;
 	bool stop = false;
 	bool isdir = false;
 	struct dentry *this;
 	unsigned int i;
 	int err;
+	char *redirect = NULL;
 
 	old_cred = ovl_override_creds(dentry->d_sb);
 	upperdir = ovl_upperdentry_dereference(poe);
@@ -110,11 +135,23 @@  struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
 				isdir = true;
 				if (poe->numlower && ovl_is_opaquedir(this))
 					stop = upperopaque = true;
+				else if (ovl_redirect_dir(dentry->d_sb)) {
+					err = ovl_check_redirect(this,
+								 &redirect);
+					if (err)
+						goto out;
+
+					if (redirect)
+						upperredirect = true;
+				}
 			}
 		}
 		upperdentry = this;
 	}
 
+	if (redirect)
+		poe = dentry->d_sb->s_root->d_fsdata;
+
 	if (!stop && poe->numlower) {
 		err = -ENOMEM;
 		stack = kcalloc(poe->numlower, sizeof(struct path), GFP_KERNEL);
@@ -125,15 +162,39 @@  struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
 	for (i = 0; !stop && i < poe->numlower; i++) {
 		struct path lowerpath = poe->lowerstack[i];
 
-		this = ovl_lookup_real(lowerpath.dentry, &dentry->d_name);
-		err = PTR_ERR(this);
-		if (IS_ERR(this)) {
-			/*
-			 * If it's positive, then treat ENAMETOOLONG as ENOENT.
-			 */
-			if (err == -ENAMETOOLONG && (upperdentry || ctr))
-				continue;
-			goto out_put;
+		if (!redirect) {
+			this = ovl_lookup_real(lowerpath.dentry, &dentry->d_name);
+			err = PTR_ERR(this);
+			if (IS_ERR(this)) {
+				/*
+				 * If it's positive, then treat ENAMETOOLONG as ENOENT.
+				 */
+				if (err == -ENAMETOOLONG && (upperdentry || ctr))
+					continue;
+				goto out_put;
+			}
+		} else {
+			struct path thispath;
+
+			err = vfs_path_lookup(lowerpath.dentry, lowerpath.mnt,
+					      redirect, 0, &thispath);
+
+			if (err) {
+				if (err == -ENOENT || err == -ENAMETOOLONG)
+					this = NULL;
+			} else {
+				this = thispath.dentry;
+				mntput(thispath.mnt);
+				if (!this->d_inode) {
+					dput(this);
+					this = NULL;
+				} else if (ovl_dentry_weird(this)) {
+					dput(this);
+					err = -EREMOTE;
+				}
+			}
+			if (err)
+				goto out_put;
 		}
 		if (!this)
 			continue;
@@ -162,6 +223,23 @@  struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
 		stack[ctr].dentry = this;
 		stack[ctr].mnt = lowerpath.mnt;
 		ctr++;
+
+		if (!stop && i != poe->numlower - 1 &&
+		    d_is_dir(this) && ovl_redirect_dir(dentry->d_sb)) {
+			err = ovl_check_redirect(this, &redirect);
+			if (err)
+				goto out_put;
+
+			if (redirect && poe != dentry->d_sb->s_root->d_fsdata) {
+				poe = dentry->d_sb->s_root->d_fsdata;
+
+				for (i = 0; i < poe->numlower; i++)
+					if (poe->lowerstack[i].mnt == lowerpath.mnt)
+						break;
+				if (WARN_ON(i == poe->numlower))
+					break;
+			}
+		}
 	}
 
 	oe = ovl_alloc_entry(ctr);
@@ -192,9 +270,11 @@  struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
 
 	revert_creds(old_cred);
 	oe->opaque = upperopaque;
+	oe->redirect = upperredirect;
 	oe->__upperdentry = upperdentry;
 	memcpy(oe->lowerstack, stack, sizeof(struct path) * ctr);
 	kfree(stack);
+	kfree(redirect);
 	dentry->d_fsdata = oe;
 	d_add(dentry, inode);
 
@@ -209,6 +289,7 @@  struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
 out_put_upper:
 	dput(upperdentry);
 out:
+	kfree(redirect);
 	revert_creds(old_cred);
 	return ERR_PTR(err);
 }
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index d61d5b9d0d91..9d80ce367ad8 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -20,6 +20,7 @@  enum ovl_path_type {
 #define OVL_XATTR_PREFIX XATTR_TRUSTED_PREFIX "overlay."
 #define OVL_XATTR_OPAQUE OVL_XATTR_PREFIX "opaque"
 #define OVL_XATTR_FEATURES OVL_XATTR_PREFIX "features"
+#define OVL_XATTR_REDIRECT OVL_XATTR_PREFIX "redirect"
 
 #define OVL_ISUPPER_MASK 1UL
 
@@ -157,6 +158,9 @@  void ovl_set_dir_cache(struct dentry *dentry, struct ovl_dir_cache *cache);
 bool ovl_dentry_is_opaque(struct dentry *dentry);
 bool ovl_dentry_is_whiteout(struct dentry *dentry);
 void ovl_dentry_set_opaque(struct dentry *dentry, bool opaque);
+bool ovl_redirect_dir(struct super_block *sb);
+bool ovl_dentry_is_redirect(struct dentry *dentry);
+void ovl_dentry_set_redirect(struct dentry *dentry);
 void ovl_dentry_update(struct dentry *dentry, struct dentry *upperdentry);
 void ovl_inode_init(struct inode *inode, struct inode *realinode,
 		    bool is_upper);
diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
index 3b7ba59ad27e..2b22645535ff 100644
--- a/fs/overlayfs/ovl_entry.h
+++ b/fs/overlayfs/ovl_entry.h
@@ -26,6 +26,9 @@  struct ovl_fs {
 	struct ovl_config config;
 	/* creds of process who forced instantiation of super block */
 	const struct cred *creator_cred;
+
+	/* Check for "redirect" on directories */
+	bool redirect_dir;
 };
 
 /* private information held for every overlayfs dentry */
@@ -36,6 +39,7 @@  struct ovl_entry {
 		struct {
 			u64 version;
 			bool opaque;
+			bool redirect;
 		};
 		struct rcu_head rcu;
 	};
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index d6dc8d905d00..fc22a8a2e0d0 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -397,7 +397,7 @@  static struct dentry *ovl_workdir_create(struct vfsmount *mnt,
 	goto out_unlock;
 }
 
-static int ovl_check_features(struct dentry *root)
+static int ovl_check_features(struct ovl_fs *ufs, struct dentry *root)
 {
 	int res;
 	char *buf, *tmp, *p;
@@ -421,8 +421,12 @@  static int ovl_check_features(struct dentry *root)
 	res = 0;
 	tmp = buf;
 	while ((p = strsep(&tmp, ",")) != NULL) {
-		res = -EINVAL;
-		pr_err("overlayfs: feature '%s' not supported\n", p);
+		if (strcmp(p, "redirect_dir") == 0) {
+			ufs->redirect_dir = true;
+		} else {
+			res = -EINVAL;
+			pr_err("overlayfs: feature '%s' not supported\n", p);
+		}
 	}
 out_free:
 	kfree(buf);
@@ -494,8 +498,8 @@  static int ovl_mount_dir(const char *name, struct path *path)
 	return err;
 }
 
-static int ovl_lower_dir(const char *name, struct path *path, long *namelen,
-			 int *stack_depth, bool *remote)
+static int ovl_lower_dir(const char *name, struct path *path,
+			 struct ovl_fs *ufs, int *stack_depth, bool *remote)
 {
 	int err;
 	struct kstatfs statfs;
@@ -504,7 +508,7 @@  static int ovl_lower_dir(const char *name, struct path *path, long *namelen,
 	if (err)
 		goto out;
 
-	err = ovl_check_features(path->dentry);
+	err = ovl_check_features(ufs, path->dentry);
 	if (err)
 		goto out_put;
 
@@ -513,7 +517,7 @@  static int ovl_lower_dir(const char *name, struct path *path, long *namelen,
 		pr_err("overlayfs: statfs failed on '%s'\n", name);
 		goto out_put;
 	}
-	*namelen = max(*namelen, statfs.f_namelen);
+	ufs->lower_namelen = max(ufs->lower_namelen, statfs.f_namelen);
 	*stack_depth = max(*stack_depth, path->mnt->mnt_sb->s_stack_depth);
 
 	if (ovl_dentry_remote(path->dentry))
@@ -730,7 +734,7 @@  static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 			goto out_put_upperpath;
 		}
 
-		err = ovl_check_features(upperpath.dentry);
+		err = ovl_check_features(ufs, upperpath.dentry);
 		if (err)
 			goto out_put_upperpath;
 
@@ -771,9 +775,8 @@  static int ovl_fill_super(struct super_block *sb, void *data, int silent)
 
 	lower = lowertmp;
 	for (numlower = 0; numlower < stacklen; numlower++) {
-		err = ovl_lower_dir(lower, &stack[numlower],
-				    &ufs->lower_namelen, &sb->s_stack_depth,
-				    &remote);
+		err = ovl_lower_dir(lower, &stack[numlower], ufs,
+				    &sb->s_stack_depth, &remote);
 		if (err)
 			goto out_put_lowerpath;
 
diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index 0d45a84468d2..06dae4cabd53 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -176,6 +176,25 @@  void ovl_dentry_set_opaque(struct dentry *dentry, bool opaque)
 	oe->opaque = opaque;
 }
 
+bool ovl_redirect_dir(struct super_block *sb)
+{
+	struct ovl_fs *ofs = sb->s_fs_info;
+
+	return ofs->redirect_dir;
+}
+
+bool ovl_dentry_is_redirect(struct dentry *dentry)
+{
+	struct ovl_entry *oe = dentry->d_fsdata;
+	return oe->redirect;
+}
+
+void ovl_dentry_set_redirect(struct dentry *dentry)
+{
+	struct ovl_entry *oe = dentry->d_fsdata;
+	oe->redirect = true;
+}
+
 void ovl_dentry_update(struct dentry *dentry, struct dentry *upperdentry)
 {
 	struct ovl_entry *oe = dentry->d_fsdata;