Message ID | 20161021201335.GB20129@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Oct 21, 2016 at 11:13 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > On Fri, Oct 21, 2016 at 11:53:41AM +0300, Amir Goldstein wrote: >> On Thu, Oct 20, 2016 at 11:54 PM, Vivek Goyal <vgoyal@redhat.com> wrote: >> > On Thu, Oct 20, 2016 at 04:46:30PM -0400, Vivek Goyal wrote: >> > >> > [..] >> >> > +static ssize_t ovl_read_iter(struct kiocb *iocb, struct iov_iter *to) >> >> > +{ >> >> > + struct file *file = iocb->ki_filp; >> >> > + bool isupper = OVL_TYPE_UPPER(ovl_path_type(file->f_path.dentry)); >> >> > + ssize_t ret = -EINVAL; >> >> > + >> >> > + if (likely(!isupper)) { >> >> > + const struct file_operations *fop = ovl_real_fop(file); >> >> > + >> >> > + if (likely(fop->read_iter)) >> >> > + ret = fop->read_iter(iocb, to); >> >> > + } else { >> >> > + struct file *upperfile = filp_clone_open(file); >> >> > + >> >> >> >> IIUC, every read of lower file will call filp_clone_open(). Looking at the >> >> code of filp_clone_open(), I am concerned about the overhead of this call. >> >> Is it significant? Don't want to be paying too much of penalty for read >> >> operation on lower files. That would be a common case for containers. >> >> >> > >> > Looks like I read the code in reverse. So if I open a file read-only, >> > and if it has not been copied up, I will simply call read_iter() on >> > lower filesystem. But if file has been copied up, then I will call >> > filp_clone_open() and pay the cost. And this will continue till this >> > file is closed by caller. >> > >> >> I wonder if that cost could be reduced by calling replace_fd() or >> some variant of it to install the cloned file onto the rofd after the >> first access?? > > Hmm.., Interesting. Will something like following work? This applies on > top of Miklos's patch. It seems to work for me. It might be completely > broken/racy though. Somebody who understands this code well, will have > to have a look. > The idea sounded scary already when I suggested it :) See below what I think is scary about this implementation... Thanks for following through. > --- > fs/file.c | 41 +++++++++++++++++++++++++++++++++++++++++ > fs/overlayfs/inode.c | 1 + > 2 files changed, 42 insertions(+) > > Index: rhvgoyal-linux/fs/overlayfs/inode.c > =================================================================== > --- rhvgoyal-linux.orig/fs/overlayfs/inode.c 2016-10-21 15:43:05.391488406 -0400 > +++ rhvgoyal-linux/fs/overlayfs/inode.c 2016-10-21 16:07:57.409420795 -0400 > @@ -416,6 +416,7 @@ static ssize_t ovl_read_iter(struct kioc > if (IS_ERR(upperfile)) { > ret = PTR_ERR(upperfile); > } else { > + replace_file(file, upperfile); When fdtable is not shared (single threaded process), after this call I think that file pointer may be free (?), because file is not reference counted. Although I did not see any code in VFS callers trying to dereference the file pointer after calling read_iter(), this seems like a dangerous practice, so will need to a way to fix that. > ret = vfs_iter_read(upperfile, to, &iocb->ki_pos); > fput(upperfile); > } > Index: rhvgoyal-linux/fs/file.c > =================================================================== > --- rhvgoyal-linux.orig/fs/file.c 2016-10-21 15:43:05.391488406 -0400 > +++ rhvgoyal-linux/fs/file.c 2016-10-21 16:08:18.168420795 -0400 > @@ -864,6 +864,47 @@ Ebusy: > return -EBUSY; > } > > + > +int replace_file(struct file *old_file, struct file *new_file) > +{ > +#define MAX_TO_FREE 8 > + int n, idx = 0; > + struct files_struct *files = current->files; > + struct fdtable *fdt; > + struct file *to_free[MAX_TO_FREE]; > + bool retry = false; > + > +try_again: > + spin_lock(&files->file_lock); > + for (n = 0, fdt = files_fdtable(files); n < fdt->max_fds; n++) { > + struct file *file; > + file = rcu_dereference_check_fdtable(files, fdt->fd[n]); > + if (!file) > + continue; > + if (file == old_file) { > + get_file(new_file); > + rcu_assign_pointer(fdt->fd[n], new_file); > + to_free[idx++] = file; > + if (idx >= MAX_TO_FREE) { > + retry = true; > + break; > + } > + } > + } > + spin_unlock(&files->file_lock); > + while (idx) { > + filp_close(to_free[--idx], files); > + } > + > + if (retry) { > + retry = false; > + idx = 0; > + goto try_again; > + } > + return 0; > +} > +EXPORT_SYMBOL(replace_file); > + > int replace_fd(unsigned fd, struct file *file, unsigned flags) > { > int err; -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Oct 22, 2016 at 10:24 AM, Amir Goldstein <amir73il@gmail.com> wrote: > On Fri, Oct 21, 2016 at 11:13 PM, Vivek Goyal <vgoyal@redhat.com> wrote: >> On Fri, Oct 21, 2016 at 11:53:41AM +0300, Amir Goldstein wrote: >>> On Thu, Oct 20, 2016 at 11:54 PM, Vivek Goyal <vgoyal@redhat.com> wrote: >>> > On Thu, Oct 20, 2016 at 04:46:30PM -0400, Vivek Goyal wrote: >>> > >>> > [..] >>> >> > +static ssize_t ovl_read_iter(struct kiocb *iocb, struct iov_iter *to) >>> >> > +{ >>> >> > + struct file *file = iocb->ki_filp; >>> >> > + bool isupper = OVL_TYPE_UPPER(ovl_path_type(file->f_path.dentry)); >>> >> > + ssize_t ret = -EINVAL; >>> >> > + >>> >> > + if (likely(!isupper)) { >>> >> > + const struct file_operations *fop = ovl_real_fop(file); >>> >> > + >>> >> > + if (likely(fop->read_iter)) >>> >> > + ret = fop->read_iter(iocb, to); >>> >> > + } else { >>> >> > + struct file *upperfile = filp_clone_open(file); >>> >> > + >>> >> >>> >> IIUC, every read of lower file will call filp_clone_open(). Looking at the >>> >> code of filp_clone_open(), I am concerned about the overhead of this call. >>> >> Is it significant? Don't want to be paying too much of penalty for read >>> >> operation on lower files. That would be a common case for containers. >>> >> >>> > >>> > Looks like I read the code in reverse. So if I open a file read-only, >>> > and if it has not been copied up, I will simply call read_iter() on >>> > lower filesystem. But if file has been copied up, then I will call >>> > filp_clone_open() and pay the cost. And this will continue till this >>> > file is closed by caller. >>> > >>> >>> I wonder if that cost could be reduced by calling replace_fd() or >>> some variant of it to install the cloned file onto the rofd after the >>> first access?? >> >> Hmm.., Interesting. Will something like following work? This applies on >> top of Miklos's patch. It seems to work for me. It might be completely >> broken/racy though. Somebody who understands this code well, will have >> to have a look. >> > > The idea sounded scary already when I suggested it :) > See below what I think is scary about this implementation... > > Thanks for following through. > > >> --- >> fs/file.c | 41 +++++++++++++++++++++++++++++++++++++++++ >> fs/overlayfs/inode.c | 1 + >> 2 files changed, 42 insertions(+) >> >> Index: rhvgoyal-linux/fs/overlayfs/inode.c >> =================================================================== >> --- rhvgoyal-linux.orig/fs/overlayfs/inode.c 2016-10-21 15:43:05.391488406 -0400 >> +++ rhvgoyal-linux/fs/overlayfs/inode.c 2016-10-21 16:07:57.409420795 -0400 >> @@ -416,6 +416,7 @@ static ssize_t ovl_read_iter(struct kioc >> if (IS_ERR(upperfile)) { >> ret = PTR_ERR(upperfile); >> } else { >> + replace_file(file, upperfile); > > When fdtable is not shared (single threaded process), after this call > I think that file pointer > may be free (?), because file is not reference counted. > Although I did not see any code in VFS callers trying to dereference > the file pointer after > calling read_iter(), this seems like a dangerous practice, so will > need to a way to fix that. > My bad. file pointer is freed in work_task_run(), so replace_file() should be just as safe as replace_fd() and do_dup2(). -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Oct 22, 2016 at 5:39 PM, Amir Goldstein <amir73il@gmail.com> wrote: > On Sat, Oct 22, 2016 at 10:24 AM, Amir Goldstein <amir73il@gmail.com> wrote: >> On Fri, Oct 21, 2016 at 11:13 PM, Vivek Goyal <vgoyal@redhat.com> wrote: >>> --- >>> fs/file.c | 41 +++++++++++++++++++++++++++++++++++++++++ >>> fs/overlayfs/inode.c | 1 + >>> 2 files changed, 42 insertions(+) >>> >>> Index: rhvgoyal-linux/fs/overlayfs/inode.c >>> =================================================================== >>> --- rhvgoyal-linux.orig/fs/overlayfs/inode.c 2016-10-21 15:43:05.391488406 -0400 >>> +++ rhvgoyal-linux/fs/overlayfs/inode.c 2016-10-21 16:07:57.409420795 -0400 >>> @@ -416,6 +416,7 @@ static ssize_t ovl_read_iter(struct kioc >>> if (IS_ERR(upperfile)) { >>> ret = PTR_ERR(upperfile); >>> } else { >>> + replace_file(file, upperfile); I think it's a cool idea. But I'm not even going to look at the implementation for now, because it's such a rare corner case, that trying to optimize it should really be the last thing we do after everything else is working fine (and only if it actually turns out to be a thing that somebody actually cares about). Thanks, Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Index: rhvgoyal-linux/fs/overlayfs/inode.c =================================================================== --- rhvgoyal-linux.orig/fs/overlayfs/inode.c 2016-10-21 15:43:05.391488406 -0400 +++ rhvgoyal-linux/fs/overlayfs/inode.c 2016-10-21 16:07:57.409420795 -0400 @@ -416,6 +416,7 @@ static ssize_t ovl_read_iter(struct kioc if (IS_ERR(upperfile)) { ret = PTR_ERR(upperfile); } else { + replace_file(file, upperfile); ret = vfs_iter_read(upperfile, to, &iocb->ki_pos); fput(upperfile); } Index: rhvgoyal-linux/fs/file.c =================================================================== --- rhvgoyal-linux.orig/fs/file.c 2016-10-21 15:43:05.391488406 -0400 +++ rhvgoyal-linux/fs/file.c 2016-10-21 16:08:18.168420795 -0400 @@ -864,6 +864,47 @@ Ebusy: return -EBUSY; } + +int replace_file(struct file *old_file, struct file *new_file) +{ +#define MAX_TO_FREE 8 + int n, idx = 0; + struct files_struct *files = current->files; + struct fdtable *fdt; + struct file *to_free[MAX_TO_FREE]; + bool retry = false; + +try_again: + spin_lock(&files->file_lock); + for (n = 0, fdt = files_fdtable(files); n < fdt->max_fds; n++) { + struct file *file; + file = rcu_dereference_check_fdtable(files, fdt->fd[n]); + if (!file) + continue; + if (file == old_file) { + get_file(new_file); + rcu_assign_pointer(fdt->fd[n], new_file); + to_free[idx++] = file; + if (idx >= MAX_TO_FREE) { + retry = true; + break; + } + } + } + spin_unlock(&files->file_lock); + while (idx) { + filp_close(to_free[--idx], files); + } + + if (retry) { + retry = false; + idx = 0; + goto try_again; + } + return 0; +} +EXPORT_SYMBOL(replace_file); + int replace_fd(unsigned fd, struct file *file, unsigned flags) { int err;