Message ID | 1478817883-27662-5-git-send-email-amir73il@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Nov 11, 2016 at 12:44:43AM +0200, Amir Goldstein wrote: > If work dir and upper dir have a common parent, then it > is safe to move files between them without taking the super block > s_vfs_rename_mutex. > > Since both upper dir and work dir (as well as $workdir/work) > are delete locked on overlay mount, they cannot be moved. Do explain. What do you mean, "delete locked" and why exactly can't they be moved? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Nov 10, 2016 at 11:02:21PM +0000, Al Viro wrote: > On Fri, Nov 11, 2016 at 12:44:43AM +0200, Amir Goldstein wrote: > > If work dir and upper dir have a common parent, then it > > is safe to move files between them without taking the super block > > s_vfs_rename_mutex. > > > > Since both upper dir and work dir (as well as $workdir/work) > > are delete locked on overlay mount, they cannot be moved. > > Do explain. What do you mean, "delete locked" and why exactly can't they > be moved? More specifically, what makes you think that may_delete() is called before any change of parent? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 11, 2016 at 1:05 AM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Thu, Nov 10, 2016 at 11:02:21PM +0000, Al Viro wrote: >> On Fri, Nov 11, 2016 at 12:44:43AM +0200, Amir Goldstein wrote: >> > If work dir and upper dir have a common parent, then it >> > is safe to move files between them without taking the super block >> > s_vfs_rename_mutex. >> > >> > Since both upper dir and work dir (as well as $workdir/work) >> > are delete locked on overlay mount, they cannot be moved. >> >> Do explain. What do you mean, "delete locked" and why exactly can't they >> be moved? > That is explained in commit message of the first 2 patches, but foremost I would like to say that the name "delete locked" is just the best of all the bad choices for names I came up with, so I am expecting better name suggestions. The concept is to pin the entry so that it cannot be moved, so can have certain guaranties about the stability of the directories topology. > More specifically, what makes you think that may_delete() is called before > any change of parent? Because it is checked at the beginning of vfs_rename() and in order to change a parent, an entry needs to be renamed. no? What am I missing? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 11, 2016 at 01:11:16AM +0200, Amir Goldstein wrote: > That is explained in commit message of the first 2 patches, but > foremost I would like to say that the name "delete locked" is just the > best of all the bad choices for names I came up with, so I am expecting > better name suggestions. > > The concept is to pin the entry so that it cannot be moved, so can have > certain guaranties about the stability of the directories topology. > > > More specifically, what makes you think that may_delete() is called before > > any change of parent? > > Because it is checked at the beginning of vfs_rename() > and in order to change a parent, an entry needs to be renamed. no? > What am I missing? Analysis of the rest of call chains where we have __d_move() called? Not to mention anything else, workdir itself might be moved around by d_splice_alias() for a sufficiently unpleasant filesystem... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 11, 2016 at 1:17 AM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Fri, Nov 11, 2016 at 01:11:16AM +0200, Amir Goldstein wrote: > >> That is explained in commit message of the first 2 patches, but >> foremost I would like to say that the name "delete locked" is just the >> best of all the bad choices for names I came up with, so I am expecting >> better name suggestions. >> >> The concept is to pin the entry so that it cannot be moved, so can have >> certain guaranties about the stability of the directories topology. >> >> > More specifically, what makes you think that may_delete() is called before >> > any change of parent? >> >> Because it is checked at the beginning of vfs_rename() >> and in order to change a parent, an entry needs to be renamed. no? >> What am I missing? > > Analysis of the rest of call chains where we have __d_move() called? Not > to mention anything else, workdir itself might be moved around by > d_splice_alias() for a sufficiently unpleasant filesystem... I am certainly looking for this feedback, because there is no other means for me to sanity test if relaxing lock_rename() is safe. When I write "any change of parent" I mean a change between 2 different connected parents. Both work dir and upper dir are connected and with reference held after mount. Are the d_splice_alias() and __d_unalias() cases a real concern for moving work dir around after the overlay mount?? Plus, if a file system is sufficiently unpleasant, perhaps there is a way to categorize it as such by ovl_dentry_weird(), if it hasn't been categorized as such already. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 11, 2016 at 01:33:08AM +0200, Amir Goldstein wrote: > I am certainly looking for this feedback, because there is no other means for > me to sanity test if relaxing lock_rename() is safe. > > When I write "any change of parent" I mean a change between 2 different > connected parents. Both work dir and upper dir are connected and with > reference held after mount. > Are the d_splice_alias() and __d_unalias() cases a real concern for moving > work dir around after the overlay mount?? Why not? Again, it's really up to you to provide an analysis of the call chains. There's nothing in d_splice_alias() to prohibit an existing alias being attached - in fact, __d_unalias() is called exactly in that case. It's a rare case, all right, but it is not impossible. BTW, your analysis would better be simple and explicit - anything subtle will be flat-out rejected, since it would have to be stepped around very carefully in any later work in VFS. I really wonder what it is that you are getting contention on - what are you doing, besides the actual renames? And that needs serialization anyway (on inode lock of workdir, if nothing else), so any contention would not disappear from dropped ->s_vfs_rename_mutex... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 11, 2016 at 1:54 AM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Fri, Nov 11, 2016 at 01:33:08AM +0200, Amir Goldstein wrote: > >> I am certainly looking for this feedback, because there is no other means for >> me to sanity test if relaxing lock_rename() is safe. >> >> When I write "any change of parent" I mean a change between 2 different >> connected parents. Both work dir and upper dir are connected and with >> reference held after mount. >> Are the d_splice_alias() and __d_unalias() cases a real concern for moving >> work dir around after the overlay mount?? > > Why not? Again, it's really up to you to provide an analysis of the call > chains. There's nothing in d_splice_alias() to prohibit an existing alias > being attached - in fact, __d_unalias() is called exactly in that case. > It's a rare case, all right, but it is not impossible. > > BTW, your analysis would better be simple and explicit - anything subtle > will be flat-out rejected, since it would have to be stepped around very > carefully in any later work in VFS. > Sure, I get that. Wanted to post early, before I have a full proof that this is sane (if I am able to generate one), so people can shout at me - what the hell? soon enough. > I really wonder what it is that you are getting contention on - what are > you doing, besides the actual renames? And that needs serialization anyway > (on inode lock of workdir, if nothing else), so any contention would not > disappear from dropped ->s_vfs_rename_mutex... Yeah, I wrote in the cover letter that I did not generate performance numbers yet, which is a must for this sort of work, and that I am hoping to get some feedback from testers. But the serialization I am trying to avoid is between copy-ups and whiteouts of different overlay mounts, all on the same fs, which is the case with docker/rocket containers. Not ruling out that I am barking up the wrong tree. The burden of proof is on me. Thanks, Amir. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 11, 2016 at 02:11:56AM +0200, Amir Goldstein wrote: > Yeah, I wrote in the cover letter that I did not generate performance > numbers yet, > which is a must for this sort of work, and that I am hoping to get some feedback > from testers. > But the serialization I am trying to avoid is between copy-ups and whiteouts of > different overlay mounts, all on the same fs, which is the case with > docker/rocket > containers. > > Not ruling out that I am barking up the wrong tree. The burden of > proof is on me. Surely, the copying of data itself is outside of that lock, isn't it? And renames proper, especially if there is any kind of contention going on, will be on the metadata hot in cache, so I would really like to see the actual evidence of contention-related performance issues... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 11, 2016 at 2:27 AM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Fri, Nov 11, 2016 at 02:11:56AM +0200, Amir Goldstein wrote: >> Yeah, I wrote in the cover letter that I did not generate performance >> numbers yet, >> which is a must for this sort of work, and that I am hoping to get some feedback >> from testers. >> But the serialization I am trying to avoid is between copy-ups and whiteouts of >> different overlay mounts, all on the same fs, which is the case with >> docker/rocket >> containers. >> >> Not ruling out that I am barking up the wrong tree. The burden of >> proof is on me. > > Surely, the copying of data itself is outside of that lock, isn't it? Mmmmmm, no it isn't, but I am going to make it right. > And renames proper, especially if there is any kind of contention going on, > will be on the metadata hot in cache, so I would really like to see the > actual evidence of contention-related performance issues... I am afraid I won't find these evidence. I did a small benchmark of 2 parallel rm -rf processes on 2 different overlay mount over the same fs. In that test, every process spends 20% of the time on vfs_whiteout and 10% on vfs_rename (of those whiteouts). mutex_lock/unlock of s_vfs_rename_mutex takes 4% of the time, but the lock covers both the whiteout and rename. I estimate that after we fix the coarse grained rename_lock, it will be quite hard to demonstrate contention-related performance issues due to s_vfs_rename_mutex. Thanks for taking the time to set me straight. Amir. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 11, 2016 at 04:43:57PM +0200, Amir Goldstein wrote: > > Surely, the copying of data itself is outside of that lock, isn't it? > > Mmmmmm, no it isn't, but I am going to make it right. Well, if you really have the copyup itself done under lock_rename(), you are looking at potentially copying gigabytes of data from r/o layer to workdir under fs-wide lock. _That_ is really asking for contention from hell, and for no good reason. Renames/whiteout creations/removals are trivial noise compared to that. I'm somewhat surprised, TBH - I thought the copyup itself is done outside of that thing. Are you sure it really isn't? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 11, 2016 at 5:40 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Fri, Nov 11, 2016 at 04:43:57PM +0200, Amir Goldstein wrote: > >> > Surely, the copying of data itself is outside of that lock, isn't it? >> >> Mmmmmm, no it isn't, but I am going to make it right. > > Well, if you really have the copyup itself done under lock_rename(), you > are looking at potentially copying gigabytes of data from r/o layer to > workdir under fs-wide lock. _That_ is really asking for contention from > hell, and for no good reason. Renames/whiteout creations/removals are > trivial noise compared to that. I'm somewhat surprised, TBH - I thought > the copyup itself is done outside of that thing. Are you sure it really > isn't? I'm afraid so. It seems to me that most of the time, lock_rename() is being used in overlayfs for no better alternative to lock 2 directories (work+upper). My suggestion is a small modification to the overlayfs locking scheme. ---Instead of this: assert(lock_rename(workdir, upperdir) != NULL)); copy_up(src, tmp); vfs_rename(tmp, dst); unlock_rename(workdir, upperdir); +++Use this: assert(lock_rename(workdir, upperdir) != NULL)); mutex_unlock(s_vfs_rename_mutex); copy_up(src, tmp); inode_unlock(upperdir); inode_unlock(workdir); assert(lock_rename(workdir, upperdir) != NULL)); vfs_rename(tmp, dst); unlock_rename(workdir, upperdir); Miklos, Do you see any problem with the proposed scheme? Anything that can go wrong while releasing the workdir lock before vfs_rename()? Do you think we should do the same for the remove/create functions or leave the lock_rename() over whiteout+rename+cleanup as it is? Amir. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 11, 2016 at 06:17:07PM +0200, Amir Goldstein wrote: > I'm afraid so. > It seems to me that most of the time, lock_rename() is being used in > overlayfs for no better alternative to lock 2 directories (work+upper). > > My suggestion is a small modification to the overlayfs locking scheme. > ---Instead of this: > assert(lock_rename(workdir, upperdir) != NULL)); > copy_up(src, tmp); > vfs_rename(tmp, dst); > unlock_rename(workdir, upperdir); > > +++Use this: > assert(lock_rename(workdir, upperdir) != NULL)); > mutex_unlock(s_vfs_rename_mutex); > copy_up(src, tmp); > inode_unlock(upperdir); > inode_unlock(workdir); > assert(lock_rename(workdir, upperdir) != NULL)); > vfs_rename(tmp, dst); > unlock_rename(workdir, upperdir); > > Miklos, > > Do you see any problem with the proposed scheme? > Anything that can go wrong while releasing the workdir lock before vfs_rename()? Huh??? ->rename() definitely counts upon parents being locked; please, read the damn Documentation/filesystems/locking, it's there for a reason. The real question is why the fsck do you need to lock the workdir for the duration of copying at all. O_TMPFILE open + writes there doesn't need lock_rename() *or* parents being locked. You need the parent locked when you link the sucker in place, but that's it. IDGI... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 11, 2016 at 6:27 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Fri, Nov 11, 2016 at 06:17:07PM +0200, Amir Goldstein wrote: > >> I'm afraid so. >> It seems to me that most of the time, lock_rename() is being used in >> overlayfs for no better alternative to lock 2 directories (work+upper). >> >> My suggestion is a small modification to the overlayfs locking scheme. >> ---Instead of this: >> assert(lock_rename(workdir, upperdir) != NULL)); >> copy_up(src, tmp); >> vfs_rename(tmp, dst); >> unlock_rename(workdir, upperdir); >> >> +++Use this: >> assert(lock_rename(workdir, upperdir) != NULL)); >> mutex_unlock(s_vfs_rename_mutex); >> copy_up(src, tmp); >> inode_unlock(upperdir); >> inode_unlock(workdir); >> assert(lock_rename(workdir, upperdir) != NULL)); >> vfs_rename(tmp, dst); >> unlock_rename(workdir, upperdir); >> >> Miklos, >> >> Do you see any problem with the proposed scheme? >> Anything that can go wrong while releasing the workdir lock before vfs_rename()? > > Huh??? > > ->rename() definitely counts upon parents being locked; please, read the > damn Documentation/filesystems/locking, it's there for a reason. > > The real question is why the fsck do you need to lock the workdir for the > duration of copying at all. O_TMPFILE open + writes there doesn't need > lock_rename() *or* parents being locked. You need the parent locked when > you link the sucker in place, but that's it. IDGI... There's that. The other thing that lock_rename does is prevent multiple copy ups on the same file. Arguably it's an overkill, but a replacement needs to be added. Fact is, nobody ever reported this blatant performance bottleneck. Probably because copying up gigabyte files is not the usual use case for union filesystems... Thanks, Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 11, 2016 at 7:27 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Fri, Nov 11, 2016 at 06:17:07PM +0200, Amir Goldstein wrote: > >> I'm afraid so. >> It seems to me that most of the time, lock_rename() is being used in >> overlayfs for no better alternative to lock 2 directories (work+upper). >> >> My suggestion is a small modification to the overlayfs locking scheme. >> ---Instead of this: >> assert(lock_rename(workdir, upperdir) != NULL)); >> copy_up(src, tmp); >> vfs_rename(tmp, dst); >> unlock_rename(workdir, upperdir); >> >> +++Use this: >> assert(lock_rename(workdir, upperdir) != NULL)); >> mutex_unlock(s_vfs_rename_mutex); >> copy_up(src, tmp); >> inode_unlock(upperdir); >> inode_unlock(workdir); >> assert(lock_rename(workdir, upperdir) != NULL)); >> vfs_rename(tmp, dst); >> unlock_rename(workdir, upperdir); >> >> Miklos, >> >> Do you see any problem with the proposed scheme? >> Anything that can go wrong while releasing the workdir lock before vfs_rename()? > > Huh??? > > ->rename() definitely counts upon parents being locked; please, read the > damn Documentation/filesystems/locking, it's there for a reason. > I know it does. lock_rename() before vfs_rename() re-aquires the parent locks. The reason I am unlocking them is to be able to re-take s_vfs_rename_mutex and 2 parent dirs in correct order. > The real question is why the fsck do you need to lock the workdir for the > duration of copying at all. O_TMPFILE open + writes there doesn't need > lock_rename() *or* parents being locked. You need the parent locked when > you link the sucker in place, but that's it. IDGI... I am starting to understand the reasons of how the current locking scheme for copy up became to be what it is. It stems from the fact the the parent upper dir is used as the synchronization object to avoid double copy up (and my suggested scheme does not preserve that). Then workdir needs to be locked before we link the temp file. And then we cannot lock upperdir->workdir in that order without holding s_vfs_rename_mutex while we do that, so the whole locking became lock_rename() over the whole thing. I can see the following solution: For regular file copy up, don't use workdir at all, use O_TMPFILE instead then only need to take upperdir lock and it can be the synchronization object for the entire copy-up, which also allows for concurrent copy up to different directories. For non-regular file copy up, use workdir as it is used now and take lock_rename() for the entire copy up as it is now. If upper fs does not support O_TMPFILE, we can resort to the contending copy-up. I am working on a RFC patch. Will send it out over weekend. Bare with me guys, I'll get it right eventually... Amir. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 11, 2016 at 8:07 PM, Amir Goldstein <amir73il@gmail.com> wrote: > On Fri, Nov 11, 2016 at 7:27 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: >> On Fri, Nov 11, 2016 at 06:17:07PM +0200, Amir Goldstein wrote: >> >>> I'm afraid so. >>> It seems to me that most of the time, lock_rename() is being used in >>> overlayfs for no better alternative to lock 2 directories (work+upper). >>> >> Huh??? >> >> ->rename() definitely counts upon parents being locked; please, read the >> damn Documentation/filesystems/locking, it's there for a reason. >> > > I know it does. lock_rename() before vfs_rename() re-aquires the parent locks. > The reason I am unlocking them is to be able to re-take s_vfs_rename_mutex > and 2 parent dirs in correct order. > >> The real question is why the fsck do you need to lock the workdir for the >> duration of copying at all. O_TMPFILE open + writes there doesn't need >> lock_rename() *or* parents being locked. You need the parent locked when >> you link the sucker in place, but that's it. IDGI... > > I am starting to understand the reasons of how the current locking scheme for > copy up became to be what it is. > > It stems from the fact the the parent upper dir is used as the synchronization > object to avoid double copy up (and my suggested scheme does not preserve > that). Then workdir needs to be locked before we link the temp file. > And then we cannot lock upperdir->workdir in that order without holding > s_vfs_rename_mutex while we do that, so the whole locking became > lock_rename() over the whole thing. > > I can see the following solution: > > For regular file copy up, don't use workdir at all, use O_TMPFILE instead > then only need to take upperdir lock and it can be the synchronization > object for the entire copy-up, which also allows for concurrent copy up > to different directories. > > For non-regular file copy up, use workdir as it is used now and take > lock_rename() for the entire copy up as it is now. > > If upper fs does not support O_TMPFILE, we can resort to the > contending copy-up. > > I am working on a RFC patch. Will send it out over weekend. > Bare with me guys, I'll get it right eventually... > I posted the new patches. They are much simpler than my previous attempt. I did not go for the O_TMPFILE solution because I found a solution that IMO is simple and clean without changing too much code. In a nut shell, the overlay inode is used to synchronize copy up regular files (outside the existing locks), so lock_rename can be dropped for the period of copying data and re-acquired before the final rename. Amir. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 11, 2016 at 7:44 PM, Miklos Szeredi <miklos@szeredi.hu> wrote: > On Fri, Nov 11, 2016 at 6:27 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: >> On Fri, Nov 11, 2016 at 06:17:07PM +0200, Amir Goldstein wrote: >> >>> I'm afraid so. >>> It seems to me that most of the time, lock_rename() is being used in >>> overlayfs for no better alternative to lock 2 directories (work+upper). >>> >>> My suggestion is a small modification to the overlayfs locking scheme. >>> ---Instead of this: >>> assert(lock_rename(workdir, upperdir) != NULL)); >>> copy_up(src, tmp); >>> vfs_rename(tmp, dst); >>> unlock_rename(workdir, upperdir); >>> >>> +++Use this: >>> assert(lock_rename(workdir, upperdir) != NULL)); >>> mutex_unlock(s_vfs_rename_mutex); >>> copy_up(src, tmp); >>> inode_unlock(upperdir); >>> inode_unlock(workdir); >>> assert(lock_rename(workdir, upperdir) != NULL)); >>> vfs_rename(tmp, dst); >>> unlock_rename(workdir, upperdir); >>> >>> Miklos, >>> >>> Do you see any problem with the proposed scheme? >>> Anything that can go wrong while releasing the workdir lock before vfs_rename()? >> >> Huh??? >> >> ->rename() definitely counts upon parents being locked; please, read the >> damn Documentation/filesystems/locking, it's there for a reason. >> >> The real question is why the fsck do you need to lock the workdir for the >> duration of copying at all. O_TMPFILE open + writes there doesn't need >> lock_rename() *or* parents being locked. You need the parent locked when >> you link the sucker in place, but that's it. IDGI... > > There's that. The other thing that lock_rename does is prevent > multiple copy ups on the same file. Arguably it's an overkill, but a > replacement needs to be added. > Miklos, FYI, I started to work on copy up using O_TMPFILE with i_mutex held only for upper dir. > There's that. The other thing that lock_rename does is prevent > multiple copy ups on the same file. Arguably it's an overkill, but a > replacement needs to be added. > I think you suggested here to use a waitqueue for pending copyups? Did you mean a waitqueue per overlay parent directory inode or did you mean something else? > Fact is, nobody ever reported this blatant performance bottleneck. > Probably because copying up gigabyte files is not the usual use case > for union filesystems... Still, it's a loophole for some serious DoS. An unprivileged user inside container can touch a file and block all copy up on all other containers on that host and all renames on that host fs as well! Amir. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/namei.c b/fs/namei.c index 6040d1e..8e3be1b 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2867,6 +2867,26 @@ void unlock_rename(struct dentry *p1, struct dentry *p2) } EXPORT_SYMBOL(unlock_rename); +/* + * p1 and p2 are delete locked and so are their parents, all the way + * to a common ancestor and they are not decendant of each other, + * so it is safe to move children between them without holding + * s_vfs_rename_mutex and it is safe to lock them in any order. + */ +void lock_rename_safe(struct dentry *p1, struct dentry *p2) +{ + inode_lock_nested(p1->d_inode, I_MUTEX_PARENT); + inode_lock_nested(p2->d_inode, I_MUTEX_PARENT2); +} +EXPORT_SYMBOL(lock_rename_safe); + +void unlock_rename_safe(struct dentry *p1, struct dentry *p2) +{ + inode_unlock(p1->d_inode); + inode_unlock(p2->d_inode); +} +EXPORT_SYMBOL(unlock_rename_safe); + int vfs_create(struct inode *dir, struct dentry *dentry, umode_t mode, bool want_excl) { diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c index f18c1a6..02b5386 100644 --- a/fs/overlayfs/copy_up.c +++ b/fs/overlayfs/copy_up.c @@ -367,9 +367,8 @@ int ovl_copy_up_one(struct dentry *parent, struct dentry *dentry, return PTR_ERR(link); } - err = -EIO; - if (lock_rename(workdir, upperdir) != NULL) { - pr_err("overlayfs: failed to lock workdir+upperdir\n"); + err = ovl_lock_rename_workdir(workdir, upperdir); + if (err) { goto out_unlock; } upperdentry = ovl_dentry_upper(dentry); @@ -386,7 +385,7 @@ int ovl_copy_up_one(struct dentry *parent, struct dentry *dentry, ovl_set_timestamps(upperdir, &pstat); } out_unlock: - unlock_rename(workdir, upperdir); + ovl_unlock_rename_workdir(workdir, upperdir); do_delayed_call(&done); return err; diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c index f24b6b9..f0e9b8f 100644 --- a/fs/overlayfs/dir.c +++ b/fs/overlayfs/dir.c @@ -202,15 +202,16 @@ static int ovl_create_upper(struct dentry *dentry, struct inode *inode, return err; } -static int ovl_lock_rename_workdir(struct dentry *workdir, - struct dentry *upperdir) +int ovl_lock_rename_workdir(struct dentry *workdir, struct dentry *upperdir) { /* Workdir should not be the same as upperdir */ if (workdir == upperdir) goto err; - /* Workdir should not be subdir of upperdir and vice versa */ - if (lock_rename(workdir, upperdir) != NULL) + /* FIXME: check this once in fill_super */ + if (upperdir->d_parent == workdir->d_parent->d_parent) + lock_rename_safe(workdir, upperdir); + else if (lock_rename(workdir, upperdir) != NULL) goto err_unlock; return 0; @@ -222,6 +223,14 @@ static int ovl_lock_rename_workdir(struct dentry *workdir, return -EIO; } +void ovl_unlock_rename_workdir(struct dentry *workdir, struct dentry *upperdir) +{ + if (upperdir->d_parent == workdir->d_parent->d_parent) + unlock_rename_safe(workdir, upperdir); + else + unlock_rename(workdir, upperdir); +} + static struct dentry *ovl_clear_empty(struct dentry *dentry, struct list_head *list) { @@ -283,7 +292,7 @@ static struct dentry *ovl_clear_empty(struct dentry *dentry, ovl_cleanup_whiteouts(upper, list); ovl_cleanup(wdir, upper); - unlock_rename(workdir, upperdir); + ovl_unlock_rename_workdir(workdir, upperdir); /* dentry's upper doesn't match now, get rid of it */ d_drop(dentry); @@ -295,7 +304,7 @@ static struct dentry *ovl_clear_empty(struct dentry *dentry, out_dput: dput(opaquedir); out_unlock: - unlock_rename(workdir, upperdir); + ovl_unlock_rename_workdir(workdir, upperdir); out: return ERR_PTR(err); } @@ -450,7 +459,7 @@ static int ovl_create_over_whiteout(struct dentry *dentry, struct inode *inode, out_dput: dput(newdentry); out_unlock: - unlock_rename(workdir, upperdir); + ovl_unlock_rename_workdir(workdir, upperdir); out: if (!hardlink) { posix_acl_release(acl); @@ -657,7 +666,7 @@ static int ovl_remove_and_whiteout(struct dentry *dentry, bool is_dir) out_dput_upper: dput(upper); out_unlock: - unlock_rename(workdir, upperdir); + ovl_unlock_rename_workdir(workdir, upperdir); out_dput: dput(opaquedir); out: diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h index f6e4d35..63ca647 100644 --- a/fs/overlayfs/overlayfs.h +++ b/fs/overlayfs/overlayfs.h @@ -211,6 +211,8 @@ int ovl_create_real(struct inode *dir, struct dentry *newdentry, struct kstat *stat, const char *link, struct dentry *hardlink, bool debug); void ovl_cleanup(struct inode *dir, struct dentry *dentry); +int ovl_lock_rename_workdir(struct dentry *workdir, struct dentry *upperdir); +void ovl_unlock_rename_workdir(struct dentry *workdir, struct dentry *upperdir); /* copy_up.c */ int ovl_copy_up(struct dentry *dentry); diff --git a/include/linux/namei.h b/include/linux/namei.h index f29abda..31e3f4f 100644 --- a/include/linux/namei.h +++ b/include/linux/namei.h @@ -88,6 +88,8 @@ extern int follow_up(struct path *); extern struct dentry *lock_rename(struct dentry *, struct dentry *); extern void unlock_rename(struct dentry *, struct dentry *); +extern void lock_rename_safe(struct dentry *, struct dentry *); +extern void unlock_rename_safe(struct dentry *, struct dentry *); extern void nd_jump_link(struct path *path);
If work dir and upper dir have a common parent, then it is safe to move files between them without taking the super block s_vfs_rename_mutex. Since both upper dir and work dir (as well as $workdir/work) are delete locked on overlay mount, they cannot be moved. Therefore, if upper and work dir have the same parent at mount time, this cannot change thougout the time that overlay is mounted. As a result, moving files between work dir and upper dir cannot change the tree topology and cannot result in loops. For the same reason, it is safe to take the inode locks on work+upper before moving files, in any consistent order, as they cannot be descendants of each other. Implement a pair of helpers {unlock,lock}_rename_safe() to be used instead of the full blown {unlock,lock}_rename() pair in the case of work/upper dir having a common parent. Signed-off-by: Amir Goldstein <amir73il@gmail.com> --- fs/namei.c | 20 ++++++++++++++++++++ fs/overlayfs/copy_up.c | 7 +++---- fs/overlayfs/dir.c | 25 +++++++++++++++++-------- fs/overlayfs/overlayfs.h | 2 ++ include/linux/namei.h | 2 ++ 5 files changed, 44 insertions(+), 12 deletions(-)