[RFC,PATH,4/4] ovl: relax lock_rename when moving files between work and upper dir
diff mbox

Message ID 1478817883-27662-5-git-send-email-amir73il@gmail.com
State New
Headers show

Commit Message

Amir Goldstein Nov. 10, 2016, 10:44 p.m. UTC
If work dir and upper dir have a common parent, then it
is safe to move files between them without taking the super block
s_vfs_rename_mutex.

Since both upper dir and work dir (as well as $workdir/work)
are delete locked on overlay mount, they cannot be moved.
Therefore, if upper and work dir have the same parent at mount
time, this cannot change thougout the time that overlay is mounted.

As a result, moving files between work dir and upper dir cannot
change the tree topology and cannot result in loops. For the same
reason, it is safe to take the inode locks on work+upper before
moving files, in any consistent order, as they cannot be descendants
of each other.

Implement a pair of helpers {unlock,lock}_rename_safe() to be used
instead of the full blown {unlock,lock}_rename() pair in the case of
work/upper dir having a common parent.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/namei.c               | 20 ++++++++++++++++++++
 fs/overlayfs/copy_up.c   |  7 +++----
 fs/overlayfs/dir.c       | 25 +++++++++++++++++--------
 fs/overlayfs/overlayfs.h |  2 ++
 include/linux/namei.h    |  2 ++
 5 files changed, 44 insertions(+), 12 deletions(-)

Comments

Al Viro Nov. 10, 2016, 11:02 p.m. UTC | #1
On Fri, Nov 11, 2016 at 12:44:43AM +0200, Amir Goldstein wrote:
> If work dir and upper dir have a common parent, then it
> is safe to move files between them without taking the super block
> s_vfs_rename_mutex.
> 
> Since both upper dir and work dir (as well as $workdir/work)
> are delete locked on overlay mount, they cannot be moved.

Do explain.  What do you mean, "delete locked" and why exactly can't they
be moved?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Nov. 10, 2016, 11:05 p.m. UTC | #2
On Thu, Nov 10, 2016 at 11:02:21PM +0000, Al Viro wrote:
> On Fri, Nov 11, 2016 at 12:44:43AM +0200, Amir Goldstein wrote:
> > If work dir and upper dir have a common parent, then it
> > is safe to move files between them without taking the super block
> > s_vfs_rename_mutex.
> > 
> > Since both upper dir and work dir (as well as $workdir/work)
> > are delete locked on overlay mount, they cannot be moved.
> 
> Do explain.  What do you mean, "delete locked" and why exactly can't they
> be moved?

More specifically, what makes you think that may_delete() is called before
any change of parent?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Amir Goldstein Nov. 10, 2016, 11:11 p.m. UTC | #3
On Fri, Nov 11, 2016 at 1:05 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Thu, Nov 10, 2016 at 11:02:21PM +0000, Al Viro wrote:
>> On Fri, Nov 11, 2016 at 12:44:43AM +0200, Amir Goldstein wrote:
>> > If work dir and upper dir have a common parent, then it
>> > is safe to move files between them without taking the super block
>> > s_vfs_rename_mutex.
>> >
>> > Since both upper dir and work dir (as well as $workdir/work)
>> > are delete locked on overlay mount, they cannot be moved.
>>
>> Do explain.  What do you mean, "delete locked" and why exactly can't they
>> be moved?
>

That is explained in commit message of the first 2 patches, but
foremost I would like to say that the name "delete locked" is just the
best of all the bad choices for names I came up with, so I am expecting
better name suggestions.

The concept is to pin the entry so that it cannot be moved, so can have
certain guaranties about the stability of the directories topology.

> More specifically, what makes you think that may_delete() is called before
> any change of parent?

Because it is checked at the beginning of vfs_rename()
and in order to change a parent, an entry needs to be renamed. no?
What am I missing?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Nov. 10, 2016, 11:17 p.m. UTC | #4
On Fri, Nov 11, 2016 at 01:11:16AM +0200, Amir Goldstein wrote:

> That is explained in commit message of the first 2 patches, but
> foremost I would like to say that the name "delete locked" is just the
> best of all the bad choices for names I came up with, so I am expecting
> better name suggestions.
> 
> The concept is to pin the entry so that it cannot be moved, so can have
> certain guaranties about the stability of the directories topology.
> 
> > More specifically, what makes you think that may_delete() is called before
> > any change of parent?
> 
> Because it is checked at the beginning of vfs_rename()
> and in order to change a parent, an entry needs to be renamed. no?
> What am I missing?

Analysis of the rest of call chains where we have __d_move() called?  Not
to mention anything else, workdir itself might be moved around by
d_splice_alias() for a sufficiently unpleasant filesystem...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Amir Goldstein Nov. 10, 2016, 11:33 p.m. UTC | #5
On Fri, Nov 11, 2016 at 1:17 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Nov 11, 2016 at 01:11:16AM +0200, Amir Goldstein wrote:
>
>> That is explained in commit message of the first 2 patches, but
>> foremost I would like to say that the name "delete locked" is just the
>> best of all the bad choices for names I came up with, so I am expecting
>> better name suggestions.
>>
>> The concept is to pin the entry so that it cannot be moved, so can have
>> certain guaranties about the stability of the directories topology.
>>
>> > More specifically, what makes you think that may_delete() is called before
>> > any change of parent?
>>
>> Because it is checked at the beginning of vfs_rename()
>> and in order to change a parent, an entry needs to be renamed. no?
>> What am I missing?
>
> Analysis of the rest of call chains where we have __d_move() called?  Not
> to mention anything else, workdir itself might be moved around by
> d_splice_alias() for a sufficiently unpleasant filesystem...

I am certainly looking for this feedback, because there is no other means for
me to sanity test if relaxing lock_rename() is safe.

When I write "any change of parent" I mean a change between 2 different
connected parents. Both work dir and upper dir are connected and with
reference held after mount.
Are the d_splice_alias() and __d_unalias() cases a real concern for moving
work dir around after the overlay mount??

Plus, if a file system is sufficiently unpleasant, perhaps there is a way to
categorize it as such by ovl_dentry_weird(), if it hasn't been
categorized as such already.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Nov. 10, 2016, 11:54 p.m. UTC | #6
On Fri, Nov 11, 2016 at 01:33:08AM +0200, Amir Goldstein wrote:

> I am certainly looking for this feedback, because there is no other means for
> me to sanity test if relaxing lock_rename() is safe.
> 
> When I write "any change of parent" I mean a change between 2 different
> connected parents. Both work dir and upper dir are connected and with
> reference held after mount.
> Are the d_splice_alias() and __d_unalias() cases a real concern for moving
> work dir around after the overlay mount??

Why not?  Again, it's really up to you to provide an analysis of the call
chains.  There's nothing in d_splice_alias() to prohibit an existing alias
being attached - in fact, __d_unalias() is called exactly in that case.
It's a rare case, all right, but it is not impossible.

BTW, your analysis would better be simple and explicit - anything subtle
will be flat-out rejected, since it would have to be stepped around very
carefully in any later work in VFS.

I really wonder what it is that you are getting contention on - what are
you doing, besides the actual renames?  And that needs serialization anyway
(on inode lock of workdir, if nothing else), so any contention would not
disappear from dropped ->s_vfs_rename_mutex...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Amir Goldstein Nov. 11, 2016, 12:11 a.m. UTC | #7
On Fri, Nov 11, 2016 at 1:54 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Nov 11, 2016 at 01:33:08AM +0200, Amir Goldstein wrote:
>
>> I am certainly looking for this feedback, because there is no other means for
>> me to sanity test if relaxing lock_rename() is safe.
>>
>> When I write "any change of parent" I mean a change between 2 different
>> connected parents. Both work dir and upper dir are connected and with
>> reference held after mount.
>> Are the d_splice_alias() and __d_unalias() cases a real concern for moving
>> work dir around after the overlay mount??
>
> Why not?  Again, it's really up to you to provide an analysis of the call
> chains.  There's nothing in d_splice_alias() to prohibit an existing alias
> being attached - in fact, __d_unalias() is called exactly in that case.
> It's a rare case, all right, but it is not impossible.
>
> BTW, your analysis would better be simple and explicit - anything subtle
> will be flat-out rejected, since it would have to be stepped around very
> carefully in any later work in VFS.
>

Sure, I get that.
Wanted to post early, before I have a full proof that this is sane
(if I am able to generate one), so people can shout at me - what the hell?
soon enough.

> I really wonder what it is that you are getting contention on - what are
> you doing, besides the actual renames?  And that needs serialization anyway
> (on inode lock of workdir, if nothing else), so any contention would not
> disappear from dropped ->s_vfs_rename_mutex...

Yeah, I wrote in the cover letter that I did not generate performance
numbers yet,
which is a must for this sort of work, and that I am hoping to get some feedback
from testers.
But the serialization I am trying to avoid is between copy-ups and whiteouts of
different overlay mounts, all on the same fs, which is the case with
docker/rocket
containers.

Not ruling out that I am barking up the wrong tree. The burden of
proof is on me.

Thanks,
Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Nov. 11, 2016, 12:27 a.m. UTC | #8
On Fri, Nov 11, 2016 at 02:11:56AM +0200, Amir Goldstein wrote:
> Yeah, I wrote in the cover letter that I did not generate performance
> numbers yet,
> which is a must for this sort of work, and that I am hoping to get some feedback
> from testers.
> But the serialization I am trying to avoid is between copy-ups and whiteouts of
> different overlay mounts, all on the same fs, which is the case with
> docker/rocket
> containers.
> 
> Not ruling out that I am barking up the wrong tree. The burden of
> proof is on me.

Surely, the copying of data itself is outside of that lock, isn't it?
And renames proper, especially if there is any kind of contention going on,
will be on the metadata hot in cache, so I would really like to see the
actual evidence of contention-related performance issues...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Amir Goldstein Nov. 11, 2016, 2:43 p.m. UTC | #9
On Fri, Nov 11, 2016 at 2:27 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Nov 11, 2016 at 02:11:56AM +0200, Amir Goldstein wrote:
>> Yeah, I wrote in the cover letter that I did not generate performance
>> numbers yet,
>> which is a must for this sort of work, and that I am hoping to get some feedback
>> from testers.
>> But the serialization I am trying to avoid is between copy-ups and whiteouts of
>> different overlay mounts, all on the same fs, which is the case with
>> docker/rocket
>> containers.
>>
>> Not ruling out that I am barking up the wrong tree. The burden of
>> proof is on me.
>
> Surely, the copying of data itself is outside of that lock, isn't it?

Mmmmmm, no it isn't, but I am going to make it right.

> And renames proper, especially if there is any kind of contention going on,
> will be on the metadata hot in cache, so I would really like to see the
> actual evidence of contention-related performance issues...

I am afraid I won't find these evidence.
I did a small benchmark of 2 parallel rm -rf processes on 2 different
overlay mount
over the same fs.
In that test, every process spends 20% of the time on vfs_whiteout and 10% on
vfs_rename (of those whiteouts).  mutex_lock/unlock of s_vfs_rename_mutex
takes 4% of the time, but the lock covers both the whiteout and rename.

I estimate that after we fix the coarse grained rename_lock, it will be quite
hard to demonstrate contention-related performance issues due to
s_vfs_rename_mutex.

Thanks for taking the time to set me straight.
Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Nov. 11, 2016, 3:40 p.m. UTC | #10
On Fri, Nov 11, 2016 at 04:43:57PM +0200, Amir Goldstein wrote:

> > Surely, the copying of data itself is outside of that lock, isn't it?
> 
> Mmmmmm, no it isn't, but I am going to make it right.

Well, if you really have the copyup itself done under lock_rename(), you
are looking at potentially copying gigabytes of data from r/o layer to
workdir under fs-wide lock.  _That_ is really asking for contention from
hell, and for no good reason.  Renames/whiteout creations/removals are
trivial noise compared to that.  I'm somewhat surprised, TBH - I thought
the copyup itself is done outside of that thing.  Are you sure it really
isn't?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Amir Goldstein Nov. 11, 2016, 4:17 p.m. UTC | #11
On Fri, Nov 11, 2016 at 5:40 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Nov 11, 2016 at 04:43:57PM +0200, Amir Goldstein wrote:
>
>> > Surely, the copying of data itself is outside of that lock, isn't it?
>>
>> Mmmmmm, no it isn't, but I am going to make it right.
>
> Well, if you really have the copyup itself done under lock_rename(), you
> are looking at potentially copying gigabytes of data from r/o layer to
> workdir under fs-wide lock.  _That_ is really asking for contention from
> hell, and for no good reason.  Renames/whiteout creations/removals are
> trivial noise compared to that.  I'm somewhat surprised, TBH - I thought
> the copyup itself is done outside of that thing.  Are you sure it really
> isn't?

I'm afraid so.
It seems to me that most of the time, lock_rename() is being used in
overlayfs for no better alternative to lock 2 directories (work+upper).

My suggestion is a small modification to the overlayfs locking scheme.
---Instead of this:
assert(lock_rename(workdir, upperdir) != NULL));
   copy_up(src, tmp);
   vfs_rename(tmp, dst);
unlock_rename(workdir, upperdir);

+++Use this:
assert(lock_rename(workdir, upperdir) != NULL));
mutex_unlock(s_vfs_rename_mutex);
   copy_up(src, tmp);
inode_unlock(upperdir);
inode_unlock(workdir);
assert(lock_rename(workdir, upperdir) != NULL));
  vfs_rename(tmp, dst);
unlock_rename(workdir, upperdir);

Miklos,

Do you see any problem with the proposed scheme?
Anything that can go wrong while releasing the workdir lock before vfs_rename()?

Do you think we should do the same for the remove/create functions
or leave the lock_rename() over whiteout+rename+cleanup as it is?

Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Nov. 11, 2016, 5:27 p.m. UTC | #12
On Fri, Nov 11, 2016 at 06:17:07PM +0200, Amir Goldstein wrote:

> I'm afraid so.
> It seems to me that most of the time, lock_rename() is being used in
> overlayfs for no better alternative to lock 2 directories (work+upper).
> 
> My suggestion is a small modification to the overlayfs locking scheme.
> ---Instead of this:
> assert(lock_rename(workdir, upperdir) != NULL));
>    copy_up(src, tmp);
>    vfs_rename(tmp, dst);
> unlock_rename(workdir, upperdir);
> 
> +++Use this:
> assert(lock_rename(workdir, upperdir) != NULL));
> mutex_unlock(s_vfs_rename_mutex);
>    copy_up(src, tmp);
> inode_unlock(upperdir);
> inode_unlock(workdir);
> assert(lock_rename(workdir, upperdir) != NULL));
>   vfs_rename(tmp, dst);
> unlock_rename(workdir, upperdir);
> 
> Miklos,
> 
> Do you see any problem with the proposed scheme?
> Anything that can go wrong while releasing the workdir lock before vfs_rename()?

Huh???

->rename() definitely counts upon parents being locked; please, read the
damn Documentation/filesystems/locking, it's there for a reason.

The real question is why the fsck do you need to lock the workdir for the
duration of copying at all.  O_TMPFILE open + writes there doesn't need
lock_rename() *or* parents being locked.  You need the parent locked when
you link the sucker in place, but that's it.  IDGI...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Miklos Szeredi Nov. 11, 2016, 5:44 p.m. UTC | #13
On Fri, Nov 11, 2016 at 6:27 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Nov 11, 2016 at 06:17:07PM +0200, Amir Goldstein wrote:
>
>> I'm afraid so.
>> It seems to me that most of the time, lock_rename() is being used in
>> overlayfs for no better alternative to lock 2 directories (work+upper).
>>
>> My suggestion is a small modification to the overlayfs locking scheme.
>> ---Instead of this:
>> assert(lock_rename(workdir, upperdir) != NULL));
>>    copy_up(src, tmp);
>>    vfs_rename(tmp, dst);
>> unlock_rename(workdir, upperdir);
>>
>> +++Use this:
>> assert(lock_rename(workdir, upperdir) != NULL));
>> mutex_unlock(s_vfs_rename_mutex);
>>    copy_up(src, tmp);
>> inode_unlock(upperdir);
>> inode_unlock(workdir);
>> assert(lock_rename(workdir, upperdir) != NULL));
>>   vfs_rename(tmp, dst);
>> unlock_rename(workdir, upperdir);
>>
>> Miklos,
>>
>> Do you see any problem with the proposed scheme?
>> Anything that can go wrong while releasing the workdir lock before vfs_rename()?
>
> Huh???
>
> ->rename() definitely counts upon parents being locked; please, read the
> damn Documentation/filesystems/locking, it's there for a reason.
>
> The real question is why the fsck do you need to lock the workdir for the
> duration of copying at all.  O_TMPFILE open + writes there doesn't need
> lock_rename() *or* parents being locked.  You need the parent locked when
> you link the sucker in place, but that's it.  IDGI...

There's that.  The other thing that lock_rename does is prevent
multiple copy ups on the same file.   Arguably it's an overkill, but a
replacement needs to be added.

Fact is, nobody ever reported this blatant performance bottleneck.
Probably because copying up gigabyte files is not the usual use case
for union filesystems...

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Amir Goldstein Nov. 11, 2016, 6:07 p.m. UTC | #14
On Fri, Nov 11, 2016 at 7:27 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Nov 11, 2016 at 06:17:07PM +0200, Amir Goldstein wrote:
>
>> I'm afraid so.
>> It seems to me that most of the time, lock_rename() is being used in
>> overlayfs for no better alternative to lock 2 directories (work+upper).
>>
>> My suggestion is a small modification to the overlayfs locking scheme.
>> ---Instead of this:
>> assert(lock_rename(workdir, upperdir) != NULL));
>>    copy_up(src, tmp);
>>    vfs_rename(tmp, dst);
>> unlock_rename(workdir, upperdir);
>>
>> +++Use this:
>> assert(lock_rename(workdir, upperdir) != NULL));
>> mutex_unlock(s_vfs_rename_mutex);
>>    copy_up(src, tmp);
>> inode_unlock(upperdir);
>> inode_unlock(workdir);
>> assert(lock_rename(workdir, upperdir) != NULL));
>>   vfs_rename(tmp, dst);
>> unlock_rename(workdir, upperdir);
>>
>> Miklos,
>>
>> Do you see any problem with the proposed scheme?
>> Anything that can go wrong while releasing the workdir lock before vfs_rename()?
>
> Huh???
>
> ->rename() definitely counts upon parents being locked; please, read the
> damn Documentation/filesystems/locking, it's there for a reason.
>

I know it does. lock_rename() before vfs_rename() re-aquires the parent locks.
The reason I am unlocking them is to be able to re-take s_vfs_rename_mutex
and 2 parent dirs in correct order.

> The real question is why the fsck do you need to lock the workdir for the
> duration of copying at all.  O_TMPFILE open + writes there doesn't need
> lock_rename() *or* parents being locked.  You need the parent locked when
> you link the sucker in place, but that's it.  IDGI...

I am starting to understand the reasons of how the current locking scheme for
copy up became to be what it is.

It stems from the fact the the parent upper dir is used as the synchronization
object to avoid double copy up (and my suggested scheme does not preserve
that). Then workdir needs to be locked before we link the temp file.
And then we cannot lock upperdir->workdir in that order without holding
s_vfs_rename_mutex while we do that, so the whole locking became
lock_rename() over the whole thing.

I can see the following solution:

For regular file copy up, don't use workdir at all, use O_TMPFILE instead
then only need to take upperdir lock and it can be the synchronization
object for the entire copy-up, which also allows for concurrent copy up
to different directories.

For non-regular file copy up, use workdir as it is used now and take
lock_rename() for the entire copy up as it is now.

If upper fs does not support O_TMPFILE, we can resort to the
contending copy-up.

I am working on a RFC patch. Will send it out over weekend.
Bare with me guys, I'll get it right eventually...

Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Amir Goldstein Nov. 12, 2016, 7:45 p.m. UTC | #15
On Fri, Nov 11, 2016 at 8:07 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Fri, Nov 11, 2016 at 7:27 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> On Fri, Nov 11, 2016 at 06:17:07PM +0200, Amir Goldstein wrote:
>>
>>> I'm afraid so.
>>> It seems to me that most of the time, lock_rename() is being used in
>>> overlayfs for no better alternative to lock 2 directories (work+upper).
>>>
>> Huh???
>>
>> ->rename() definitely counts upon parents being locked; please, read the
>> damn Documentation/filesystems/locking, it's there for a reason.
>>
>
> I know it does. lock_rename() before vfs_rename() re-aquires the parent locks.
> The reason I am unlocking them is to be able to re-take s_vfs_rename_mutex
> and 2 parent dirs in correct order.
>
>> The real question is why the fsck do you need to lock the workdir for the
>> duration of copying at all.  O_TMPFILE open + writes there doesn't need
>> lock_rename() *or* parents being locked.  You need the parent locked when
>> you link the sucker in place, but that's it.  IDGI...
>
> I am starting to understand the reasons of how the current locking scheme for
> copy up became to be what it is.
>
> It stems from the fact the the parent upper dir is used as the synchronization
> object to avoid double copy up (and my suggested scheme does not preserve
> that). Then workdir needs to be locked before we link the temp file.
> And then we cannot lock upperdir->workdir in that order without holding
> s_vfs_rename_mutex while we do that, so the whole locking became
> lock_rename() over the whole thing.
>
> I can see the following solution:
>
> For regular file copy up, don't use workdir at all, use O_TMPFILE instead
> then only need to take upperdir lock and it can be the synchronization
> object for the entire copy-up, which also allows for concurrent copy up
> to different directories.
>
> For non-regular file copy up, use workdir as it is used now and take
> lock_rename() for the entire copy up as it is now.
>
> If upper fs does not support O_TMPFILE, we can resort to the
> contending copy-up.
>
> I am working on a RFC patch. Will send it out over weekend.
> Bare with me guys, I'll get it right eventually...
>

I posted the new patches. They are much simpler than my previous attempt.
I did not go for the O_TMPFILE solution because I found a solution
that IMO is simple and clean without changing too much code.

In a nut shell, the overlay inode is used to synchronize copy up
regular files (outside the existing locks), so lock_rename can be dropped
for the period of copying data and re-acquired before the final rename.

Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Amir Goldstein Jan. 12, 2017, 5:42 a.m. UTC | #16
On Fri, Nov 11, 2016 at 7:44 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Fri, Nov 11, 2016 at 6:27 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> On Fri, Nov 11, 2016 at 06:17:07PM +0200, Amir Goldstein wrote:
>>
>>> I'm afraid so.
>>> It seems to me that most of the time, lock_rename() is being used in
>>> overlayfs for no better alternative to lock 2 directories (work+upper).
>>>
>>> My suggestion is a small modification to the overlayfs locking scheme.
>>> ---Instead of this:
>>> assert(lock_rename(workdir, upperdir) != NULL));
>>>    copy_up(src, tmp);
>>>    vfs_rename(tmp, dst);
>>> unlock_rename(workdir, upperdir);
>>>
>>> +++Use this:
>>> assert(lock_rename(workdir, upperdir) != NULL));
>>> mutex_unlock(s_vfs_rename_mutex);
>>>    copy_up(src, tmp);
>>> inode_unlock(upperdir);
>>> inode_unlock(workdir);
>>> assert(lock_rename(workdir, upperdir) != NULL));
>>>   vfs_rename(tmp, dst);
>>> unlock_rename(workdir, upperdir);
>>>
>>> Miklos,
>>>
>>> Do you see any problem with the proposed scheme?
>>> Anything that can go wrong while releasing the workdir lock before vfs_rename()?
>>
>> Huh???
>>
>> ->rename() definitely counts upon parents being locked; please, read the
>> damn Documentation/filesystems/locking, it's there for a reason.
>>
>> The real question is why the fsck do you need to lock the workdir for the
>> duration of copying at all.  O_TMPFILE open + writes there doesn't need
>> lock_rename() *or* parents being locked.  You need the parent locked when
>> you link the sucker in place, but that's it.  IDGI...
>
> There's that.  The other thing that lock_rename does is prevent
> multiple copy ups on the same file.   Arguably it's an overkill, but a
> replacement needs to be added.
>

Miklos,

FYI, I started to work on copy up using O_TMPFILE with i_mutex
held only for upper dir.

> There's that.  The other thing that lock_rename does is prevent
> multiple copy ups on the same file.   Arguably it's an overkill, but a
> replacement needs to be added.
>

I think you suggested here to use a waitqueue for pending copyups?
Did you mean a waitqueue per overlay parent directory inode or
did you mean something else?

> Fact is, nobody ever reported this blatant performance bottleneck.
> Probably because copying up gigabyte files is not the usual use case
> for union filesystems...

Still, it's a loophole for some serious DoS. An unprivileged user inside
container can touch a file and block all copy up on all other containers
on that host and all renames on that host fs as well!

Amir.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/fs/namei.c b/fs/namei.c
index 6040d1e..8e3be1b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2867,6 +2867,26 @@  void unlock_rename(struct dentry *p1, struct dentry *p2)
 }
 EXPORT_SYMBOL(unlock_rename);
 
+/*
+ * p1 and p2 are delete locked and so are their parents, all the way
+ * to a common ancestor and they are not decendant of each other,
+ * so it is safe to move children between them without holding
+ * s_vfs_rename_mutex and it is safe to lock them in any order.
+ */
+void lock_rename_safe(struct dentry *p1, struct dentry *p2)
+{
+	inode_lock_nested(p1->d_inode, I_MUTEX_PARENT);
+	inode_lock_nested(p2->d_inode, I_MUTEX_PARENT2);
+}
+EXPORT_SYMBOL(lock_rename_safe);
+
+void unlock_rename_safe(struct dentry *p1, struct dentry *p2)
+{
+	inode_unlock(p1->d_inode);
+	inode_unlock(p2->d_inode);
+}
+EXPORT_SYMBOL(unlock_rename_safe);
+
 int vfs_create(struct inode *dir, struct dentry *dentry, umode_t mode,
 		bool want_excl)
 {
diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index f18c1a6..02b5386 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -367,9 +367,8 @@  int ovl_copy_up_one(struct dentry *parent, struct dentry *dentry,
 			return PTR_ERR(link);
 	}
 
-	err = -EIO;
-	if (lock_rename(workdir, upperdir) != NULL) {
-		pr_err("overlayfs: failed to lock workdir+upperdir\n");
+	err = ovl_lock_rename_workdir(workdir, upperdir);
+	if (err) {
 		goto out_unlock;
 	}
 	upperdentry = ovl_dentry_upper(dentry);
@@ -386,7 +385,7 @@  int ovl_copy_up_one(struct dentry *parent, struct dentry *dentry,
 		ovl_set_timestamps(upperdir, &pstat);
 	}
 out_unlock:
-	unlock_rename(workdir, upperdir);
+	ovl_unlock_rename_workdir(workdir, upperdir);
 	do_delayed_call(&done);
 
 	return err;
diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index f24b6b9..f0e9b8f 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -202,15 +202,16 @@  static int ovl_create_upper(struct dentry *dentry, struct inode *inode,
 	return err;
 }
 
-static int ovl_lock_rename_workdir(struct dentry *workdir,
-				   struct dentry *upperdir)
+int ovl_lock_rename_workdir(struct dentry *workdir, struct dentry *upperdir)
 {
 	/* Workdir should not be the same as upperdir */
 	if (workdir == upperdir)
 		goto err;
 
-	/* Workdir should not be subdir of upperdir and vice versa */
-	if (lock_rename(workdir, upperdir) != NULL)
+	/* FIXME: check this once in fill_super */
+	if (upperdir->d_parent == workdir->d_parent->d_parent)
+		lock_rename_safe(workdir, upperdir);
+	else if (lock_rename(workdir, upperdir) != NULL)
 		goto err_unlock;
 
 	return 0;
@@ -222,6 +223,14 @@  static int ovl_lock_rename_workdir(struct dentry *workdir,
 	return -EIO;
 }
 
+void ovl_unlock_rename_workdir(struct dentry *workdir, struct dentry *upperdir)
+{
+	if (upperdir->d_parent == workdir->d_parent->d_parent)
+		unlock_rename_safe(workdir, upperdir);
+	else
+		unlock_rename(workdir, upperdir);
+}
+
 static struct dentry *ovl_clear_empty(struct dentry *dentry,
 				      struct list_head *list)
 {
@@ -283,7 +292,7 @@  static struct dentry *ovl_clear_empty(struct dentry *dentry,
 
 	ovl_cleanup_whiteouts(upper, list);
 	ovl_cleanup(wdir, upper);
-	unlock_rename(workdir, upperdir);
+	ovl_unlock_rename_workdir(workdir, upperdir);
 
 	/* dentry's upper doesn't match now, get rid of it */
 	d_drop(dentry);
@@ -295,7 +304,7 @@  static struct dentry *ovl_clear_empty(struct dentry *dentry,
 out_dput:
 	dput(opaquedir);
 out_unlock:
-	unlock_rename(workdir, upperdir);
+	ovl_unlock_rename_workdir(workdir, upperdir);
 out:
 	return ERR_PTR(err);
 }
@@ -450,7 +459,7 @@  static int ovl_create_over_whiteout(struct dentry *dentry, struct inode *inode,
 out_dput:
 	dput(newdentry);
 out_unlock:
-	unlock_rename(workdir, upperdir);
+	ovl_unlock_rename_workdir(workdir, upperdir);
 out:
 	if (!hardlink) {
 		posix_acl_release(acl);
@@ -657,7 +666,7 @@  static int ovl_remove_and_whiteout(struct dentry *dentry, bool is_dir)
 out_dput_upper:
 	dput(upper);
 out_unlock:
-	unlock_rename(workdir, upperdir);
+	ovl_unlock_rename_workdir(workdir, upperdir);
 out_dput:
 	dput(opaquedir);
 out:
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index f6e4d35..63ca647 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -211,6 +211,8 @@  int ovl_create_real(struct inode *dir, struct dentry *newdentry,
 		    struct kstat *stat, const char *link,
 		    struct dentry *hardlink, bool debug);
 void ovl_cleanup(struct inode *dir, struct dentry *dentry);
+int ovl_lock_rename_workdir(struct dentry *workdir, struct dentry *upperdir);
+void ovl_unlock_rename_workdir(struct dentry *workdir, struct dentry *upperdir);
 
 /* copy_up.c */
 int ovl_copy_up(struct dentry *dentry);
diff --git a/include/linux/namei.h b/include/linux/namei.h
index f29abda..31e3f4f 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -88,6 +88,8 @@  extern int follow_up(struct path *);
 
 extern struct dentry *lock_rename(struct dentry *, struct dentry *);
 extern void unlock_rename(struct dentry *, struct dentry *);
+extern void lock_rename_safe(struct dentry *, struct dentry *);
+extern void unlock_rename_safe(struct dentry *, struct dentry *);
 
 extern void nd_jump_link(struct path *path);