diff mbox

[v2,08/20] ovl: lookup index entry for non-dir

Message ID 1496821884-5178-9-git-send-email-amir73il@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Amir Goldstein June 7, 2017, 7:51 a.m. UTC
When inodes index feature is enabled, lookup in indexdir for the index
entry of lower real inode or copy up origin inode. The index entry name
is the hex representation of the lower inode file handle.

If the index dentry in negative, then either no lower aliases have been
copied up yet, or aliases have been copied up in older kernels and are
not indexed.

If the index dentry for a copy up origin inode is positive, but points
to an inode different than the upper inode, then either the upper inode
has been copied up and not indexed or it was indexed, but since then
index dir was cleared. Either way, that index cannot be used to indentify
the overlay inode.

If a negative index dentry is found or a positive dentry that matches the
upper inode, store it in the overlay dentry to be used later. A non-upper
overlay dentry may still have a positive index from copy up of another
lower hardlink. This situation can be tested with the path type macros
OVL_TYPE_INDEX(type) && !OVL_TYPE_UPPER(type).

This is going to be used to prevent breaking hardlinks on copy up.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/overlayfs/namei.c     | 89 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/overlayfs/overlayfs.h |  3 ++
 fs/overlayfs/ovl_entry.h |  1 +
 fs/overlayfs/super.c     |  1 +
 fs/overlayfs/util.c      | 11 ++++++
 5 files changed, 105 insertions(+)

Comments

Miklos Szeredi June 8, 2017, 12:11 p.m. UTC | #1
On Wed, Jun 7, 2017 at 9:51 AM, Amir Goldstein <amir73il@gmail.com> wrote:
> When inodes index feature is enabled, lookup in indexdir for the index
> entry of lower real inode or copy up origin inode. The index entry name
> is the hex representation of the lower inode file handle.
>
> If the index dentry in negative, then either no lower aliases have been
> copied up yet, or aliases have been copied up in older kernels and are
> not indexed.
>
> If the index dentry for a copy up origin inode is positive, but points
> to an inode different than the upper inode, then either the upper inode
> has been copied up and not indexed or it was indexed, but since then
> index dir was cleared. Either way, that index cannot be used to indentify
> the overlay inode.
>
> If a negative index dentry is found or a positive dentry that matches the
> upper inode, store it in the overlay dentry to be used later. A non-upper
> overlay dentry may still have a positive index from copy up of another
> lower hardlink. This situation can be tested with the path type macros
> OVL_TYPE_INDEX(type) && !OVL_TYPE_UPPER(type).
>
> This is going to be used to prevent breaking hardlinks on copy up.

Why is a negative index (or any index, for that matter) stored in the
overlay dentry?  This seems a big waste, since the index dentry will
be allocated for all lower files, yet never used unless copied up.

Index is used:

  - at lookup need to find any copied up alias
  - at copyup need to set up new index

In both cases we can just do a fresh lookup in the index dir with
locks held (which is a good idea anyway).

Something related: should upperdentry of aliases be hardlinked to the
index on lookup and copy-up?  Or should the "hardlink-up" be deferred
until rename?  I think updating as soon as possible is the simpler of
the two.

Thanks,
Miklos
Amir Goldstein June 8, 2017, 2:48 p.m. UTC | #2
On Thu, Jun 8, 2017 at 3:11 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Wed, Jun 7, 2017 at 9:51 AM, Amir Goldstein <amir73il@gmail.com> wrote:
>> When inodes index feature is enabled, lookup in indexdir for the index
>> entry of lower real inode or copy up origin inode. The index entry name
>> is the hex representation of the lower inode file handle.
>>
>> If the index dentry in negative, then either no lower aliases have been
>> copied up yet, or aliases have been copied up in older kernels and are
>> not indexed.
>>
>> If the index dentry for a copy up origin inode is positive, but points
>> to an inode different than the upper inode, then either the upper inode
>> has been copied up and not indexed or it was indexed, but since then
>> index dir was cleared. Either way, that index cannot be used to indentify
>> the overlay inode.
>>
>> If a negative index dentry is found or a positive dentry that matches the
>> upper inode, store it in the overlay dentry to be used later. A non-upper
>> overlay dentry may still have a positive index from copy up of another
>> lower hardlink. This situation can be tested with the path type macros
>> OVL_TYPE_INDEX(type) && !OVL_TYPE_UPPER(type).
>>
>> This is going to be used to prevent breaking hardlinks on copy up.
>
> Why is a negative index (or any index, for that matter) stored in the
> overlay dentry?

One of the reasons it is stored in the overlay dentry is to be able to
test OVL_TYPE_INDEX() on the dentry, for example, in:
"ovl: adjust overlay inode nlink for indexed inodes".

OVL_TYPE_INDEX() is only interested in positive index, so it may seem
like a better idea to save memory and store the index dentry on lookup
only if it is positive, but then OVL_TYPE_INDEX() will be wrong for lower
aliases that became lower indexed aliases due to another alias copy up.
This is important for reasons that are mentioned below.

There may be an opportunity for a massive optimization though.
For all practical purpose, ovl_type_indexed_lower() should always be
false for lower with nlink == 1, so there is no reason so store the negative
index dentry in that case.

> This seems a big waste, since the index dentry will
> be allocated for all lower files, yet never used unless copied up.
>
> Index is used:
>
>   - at lookup need to find any copied up alias
>   - at copyup need to set up new index

So it has several more subtle uses:
- When whiteout a lower aliases, we need to count down nlink to
  know when we can unlink the an orphan index (TODO)
- Same when renaming over a lower indexed alias
- The lower hardlinks copy up on read [1] is another big user

>
> In both cases we can just do a fresh lookup in the index dir with
> locks held (which is a good idea anyway).
>
> Something related: should upperdentry of aliases be hardlinked to the
> index on lookup and copy-up?  Or should the "hardlink-up" be deferred
> until rename?  I think updating as soon as possible is the simpler of
> the two.
>

Agree that updating "as soon as possible" is simpler.
I went even further with "as soon as possible" with lower hardlinks copy
up on read [1]. There is a subtle inconsistency between ovl_inode_real()
and ovl_{dentry,path}_real() of those lower indexed aliases that is hard
to deal with if we are not hardlinking up on the first access.

But there is another reason we cannot defer indexing until rename -
We need to be able to decode a non-dir file handle if its parent was
renamed, so the index has to be there even if the the upper file itself
was not renamed.

I guess that means we will also need to index directories on copy up.
Too bad. I though it was enough to index directories on rename.

Amir.

[1] https://github.com/amir73il/linux/commits/ovl-hardlinks-rocopyup
Miklos Szeredi June 8, 2017, 3:17 p.m. UTC | #3
On Thu, Jun 8, 2017 at 4:48 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Thu, Jun 8, 2017 at 3:11 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>> On Wed, Jun 7, 2017 at 9:51 AM, Amir Goldstein <amir73il@gmail.com> wrote:
>>> When inodes index feature is enabled, lookup in indexdir for the index
>>> entry of lower real inode or copy up origin inode. The index entry name
>>> is the hex representation of the lower inode file handle.
>>>
>>> If the index dentry in negative, then either no lower aliases have been
>>> copied up yet, or aliases have been copied up in older kernels and are
>>> not indexed.
>>>
>>> If the index dentry for a copy up origin inode is positive, but points
>>> to an inode different than the upper inode, then either the upper inode
>>> has been copied up and not indexed or it was indexed, but since then
>>> index dir was cleared. Either way, that index cannot be used to indentify
>>> the overlay inode.
>>>
>>> If a negative index dentry is found or a positive dentry that matches the
>>> upper inode, store it in the overlay dentry to be used later. A non-upper
>>> overlay dentry may still have a positive index from copy up of another
>>> lower hardlink. This situation can be tested with the path type macros
>>> OVL_TYPE_INDEX(type) && !OVL_TYPE_UPPER(type).
>>>
>>> This is going to be used to prevent breaking hardlinks on copy up.
>>
>> Why is a negative index (or any index, for that matter) stored in the
>> overlay dentry?
>
> One of the reasons it is stored in the overlay dentry is to be able to
> test OVL_TYPE_INDEX() on the dentry, for example, in:
> "ovl: adjust overlay inode nlink for indexed inodes".

OVL_TYPE_INDEX() is just a bool.  When we do the *first* copy up of a
file with multiple links we need to update all the aliases anyway:
i.e. do the hardlink-ups, update ->upperdentry, etc.

> OVL_TYPE_INDEX() is only interested in positive index, so it may seem
> like a better idea to save memory and store the index dentry on lookup
> only if it is positive, but then OVL_TYPE_INDEX() will be wrong for lower
> aliases that became lower indexed aliases due to another alias copy up.
> This is important for reasons that are mentioned below.
>
> There may be an opportunity for a massive optimization though.
> For all practical purpose, ovl_type_indexed_lower() should always be
> false for lower with nlink == 1, so there is no reason so store the negative
> index dentry in that case.
>
>> This seems a big waste, since the index dentry will
>> be allocated for all lower files, yet never used unless copied up.
>>
>> Index is used:
>>
>>   - at lookup need to find any copied up alias
>>   - at copyup need to set up new index
>
> So it has several more subtle uses:
> - When whiteout a lower aliases, we need to count down nlink to
>   know when we can unlink the an orphan index (TODO)

If we do the link-up early (at lookup) then whiteout won't need
special casing.  The link-up would be unnecessary in this case, but
delaying it will just cause headaches.

> - Same when renaming over a lower indexed alias

And the same case here.

> - The lower hardlinks copy up on read [1] is another big user

Again, doing it on lookup will be earlier than at read, so no issue.

Thanks,
Miklos
Amir Goldstein June 8, 2017, 4:09 p.m. UTC | #4
On Thu, Jun 8, 2017 at 6:17 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Thu, Jun 8, 2017 at 4:48 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>> On Thu, Jun 8, 2017 at 3:11 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>>> On Wed, Jun 7, 2017 at 9:51 AM, Amir Goldstein <amir73il@gmail.com> wrote:
>>>> When inodes index feature is enabled, lookup in indexdir for the index
>>>> entry of lower real inode or copy up origin inode. The index entry name
>>>> is the hex representation of the lower inode file handle.
>>>>
>>>> If the index dentry in negative, then either no lower aliases have been
>>>> copied up yet, or aliases have been copied up in older kernels and are
>>>> not indexed.
>>>>
>>>> If the index dentry for a copy up origin inode is positive, but points
>>>> to an inode different than the upper inode, then either the upper inode
>>>> has been copied up and not indexed or it was indexed, but since then
>>>> index dir was cleared. Either way, that index cannot be used to indentify
>>>> the overlay inode.
>>>>
>>>> If a negative index dentry is found or a positive dentry that matches the
>>>> upper inode, store it in the overlay dentry to be used later. A non-upper
>>>> overlay dentry may still have a positive index from copy up of another
>>>> lower hardlink. This situation can be tested with the path type macros
>>>> OVL_TYPE_INDEX(type) && !OVL_TYPE_UPPER(type).
>>>>
>>>> This is going to be used to prevent breaking hardlinks on copy up.
>>>
>>> Why is a negative index (or any index, for that matter) stored in the
>>> overlay dentry?
>>
>> One of the reasons it is stored in the overlay dentry is to be able to
>> test OVL_TYPE_INDEX() on the dentry, for example, in:
>> "ovl: adjust overlay inode nlink for indexed inodes".
>
> OVL_TYPE_INDEX() is just a bool.  When we do the *first* copy up of a
> file with multiple links we need to update all the aliases anyway:
> i.e. do the hardlink-ups, update ->upperdentry, etc.
>

You mean iterate all existing the overlay inode dentry cache aliases?
Makes sense. together with hardlink-up on lookup, this will make
ovl_type_indexed_lower() impossible. I guess we need not worry about
unhashed/disconnected overlay aliases right now?
although we might need to revisit this assumption with NFS export.

>>
>>> This seems a big waste, since the index dentry will
>>> be allocated for all lower files, yet never used unless copied up.
>>>
>>> Index is used:
>>>
>>>   - at lookup need to find any copied up alias
>>>   - at copyup need to set up new index
>>
>> So it has several more subtle uses:
>> - When whiteout a lower aliases, we need to count down nlink to
>>   know when we can unlink the an orphan index (TODO)
>
> If we do the link-up early (at lookup) then whiteout won't need
> special casing.  The link-up would be unnecessary in this case, but
> delaying it will just cause headaches.
>

While on the subject, maybe you also have an idea about how to
account for whiteouts *before* the first copy up:
If we have N lower hardlinks, delete/whiteout N-1 then copy up
the Nth and then delete the Nth, we need to store the negative nlink
count (N-1) before the index exists to know that we can turn the
orphan index into a whiteout (so NFS decode will guaranty ESTALE).

One though I had was to keep an "anti-index" dir with whiteouts
that are covering the anti-indexed lower.
On mount and on ovl_inode_evict, need to test:
origin.nlink == anti-index.nlink && index.nlink == 1 and unlink
the (positive) index entry. NFS decode can find the anti-index
compare origin.nlink == anti-index.nlink and reach the right
conclusion.

I couldn't find an appropriate solution for rename over though.

>> - Same when renaming over a lower indexed alias
>
> And the same case here.
>
>> - The lower hardlinks copy up on read [1] is another big user
>
> Again, doing it on lookup will be earlier than at read, so no issue.
>

Yes, that would be much better.
I'll rework the patches.

Amir.
Miklos Szeredi June 9, 2017, 8:43 a.m. UTC | #5
On Thu, Jun 8, 2017 at 6:09 PM, Amir Goldstein <amir73il@gmail.com> wrote:

> You mean iterate all existing the overlay inode dentry cache aliases?

Yes.

And locking is needed.  The INUSE flag could provide that, but I'd
prefer something local to the overlay instance (i.e. in ovl_fs).  The
reason is that lower layers could be shared and then separate
instances would be interfering with each other, which is not nice.
First version could just do a single mutex in ovl_fs.  Could refine
that to an array of locks hashed by origin inode, or something.

>
>>>
>>>> This seems a big waste, since the index dentry will
>>>> be allocated for all lower files, yet never used unless copied up.
>>>>
>>>> Index is used:
>>>>
>>>>   - at lookup need to find any copied up alias
>>>>   - at copyup need to set up new index
>>>
>>> So it has several more subtle uses:
>>> - When whiteout a lower aliases, we need to count down nlink to
>>>   know when we can unlink the an orphan index (TODO)
>>
>> If we do the link-up early (at lookup) then whiteout won't need
>> special casing.  The link-up would be unnecessary in this case, but
>> delaying it will just cause headaches.
>>
>
> While on the subject, maybe you also have an idea about how to
> account for whiteouts *before* the first copy up:

Ah, another special case.  We could un-special it by doing the copy-up
to index anyway before commencing with the delete (only for i_nlink >
1, obviously).  Bit of a hack, but would make things simpler.

Thanks,
Miklos
Amir Goldstein June 9, 2017, 9:38 a.m. UTC | #6
On Fri, Jun 9, 2017 at 11:43 AM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Thu, Jun 8, 2017 at 6:09 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>
>> You mean iterate all existing the overlay inode dentry cache aliases?
>
> Yes.
>
> And locking is needed.  The INUSE flag could provide that, but I'd
> prefer something local to the overlay instance (i.e. in ovl_fs).  The
> reason is that lower layers could be shared and then separate
> instances would be interfering with each other, which is not nice.
> First version could just do a single mutex in ovl_fs.  Could refine
> that to an array of locks hashed by origin inode, or something.
>

Then how about inuse_lock the overlay inode?
It is already unique among all lower/upper aliases.
In fact,  oe->copying and ovl_{start,end}_copy_up could probably
be moved from the overlay dentry to the overlay inode.
It should work the same for the old use cases and provide the needed
(granular) locking for hardlinks copy-aliases-up.
I would just need to implement inode_inuse_lock().
Am I missing something?

>>
>> While on the subject, maybe you also have an idea about how to
>> account for whiteouts *before* the first copy up:
>
> Ah, another special case.  We could un-special it by doing the copy-up
> to index anyway before commencing with the delete (only for i_nlink >
> 1, obviously).  Bit of a hack, but would make things simpler.
>

This certainly works for my use case of clone up, so I don't mind wiring up
this workaround. We would still need to keep persistent count of copy ups
(e.g. in trusted.overlay.nlink of upper), because the condition to whiteout the
index on mount/evict is (origin->i_nlink == trusted.overlay.nlink &&
index->i_nlink == 1), but I was planning to do that anyway.

<soliciting> copy-up-on-delete is also classic use case for metadata-copy-up.
And then we can proceed to metadata-index-copy-up on open for read to
solve the consistent ro/rw fd.
Did you get a chance to look at the page mapping sharing?
Hey, I already set i_data.privae_data to the origin inode.
All you need to do is steel the pages ;-) </soliciting>

Thanks,
Amir.
Miklos Szeredi June 9, 2017, 11:49 a.m. UTC | #7
On Fri, Jun 9, 2017 at 11:38 AM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Fri, Jun 9, 2017 at 11:43 AM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>> On Thu, Jun 8, 2017 at 6:09 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>
>>> You mean iterate all existing the overlay inode dentry cache aliases?
>>
>> Yes.
>>
>> And locking is needed.  The INUSE flag could provide that, but I'd
>> prefer something local to the overlay instance (i.e. in ovl_fs).  The
>> reason is that lower layers could be shared and then separate
>> instances would be interfering with each other, which is not nice.
>> First version could just do a single mutex in ovl_fs.  Could refine
>> that to an array of locks hashed by origin inode, or something.
>>
>
> Then how about inuse_lock the overlay inode?
> It is already unique among all lower/upper aliases.
> In fact,  oe->copying and ovl_{start,end}_copy_up could probably
> be moved from the overlay dentry to the overlay inode.
> It should work the same for the old use cases and provide the needed
> (granular) locking for hardlinks copy-aliases-up.

I like that.

> I would just need to implement inode_inuse_lock().

Just put the mutex in the new struct ovl_inode?

>
>>>
>>> While on the subject, maybe you also have an idea about how to
>>> account for whiteouts *before* the first copy up:
>>
>> Ah, another special case.  We could un-special it by doing the copy-up
>> to index anyway before commencing with the delete (only for i_nlink >
>> 1, obviously).  Bit of a hack, but would make things simpler.
>>
>
> This certainly works for my use case of clone up, so I don't mind wiring up
> this workaround. We would still need to keep persistent count of copy ups
> (e.g. in trusted.overlay.nlink of upper), because the condition to whiteout the
> index on mount/evict is (origin->i_nlink == trusted.overlay.nlink &&
> index->i_nlink == 1), but I was planning to do that anyway.

Yes.

The challenge is in accounting the number of link-ups atomically with
the link-up itself.  No ideas off-hand.

> <soliciting> copy-up-on-delete is also classic use case for metadata-copy-up.
> And then we can proceed to metadata-index-copy-up on open for read to
> solve the consistent ro/rw fd.
> Did you get a chance to look at the page mapping sharing?

Not yet.

Thanks,
Miklos
Miklos Szeredi June 9, 2017, 1:14 p.m. UTC | #8
On Fri, Jun 9, 2017 at 1:49 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Fri, Jun 9, 2017 at 11:38 AM, Amir Goldstein <amir73il@gmail.com> wrote:

> The challenge is in accounting the number of link-ups atomically with
> the link-up itself.  No ideas off-hand.

Following would work I think:

- copy up to indexdir (it now has st_nlink == 1)
- set overlay.nlinkup to "1-2"
  + the first number indicates the target number of linkups
  + the second indicates the target st_nlink on the file
- fsync the file (hopefully this writes the xattrs as well as the data)
- link the file from indexdir to upper
- set overlay.nlinkup to "1"
   + this indicates that the linkup was finished and number of linkups is 1.

There are 3 cases when we find a file with overlay.nlinkup:

- just one number: nothing to do
- two numbers and second number == st_nlink: replace with just the linkup value
- two numbers and second number != st_nlink: replace with decremented
linkup value

So the atomicity is guaranteed by st_nlink becoming in sync with the
target value.

Would need to ensure that ovl_link() is not performed in parallel with
the linkup procedure.

Thanks,
Miklos

 - power fai


overlay.nlinkup
Amir Goldstein June 9, 2017, 1:24 p.m. UTC | #9
On Fri, Jun 9, 2017 at 4:14 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Fri, Jun 9, 2017 at 1:49 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>> On Fri, Jun 9, 2017 at 11:38 AM, Amir Goldstein <amir73il@gmail.com> wrote:
>
>> The challenge is in accounting the number of link-ups atomically with
>> the link-up itself.  No ideas off-hand.
>
> Following would work I think:
>
> - copy up to indexdir (it now has st_nlink == 1)
> - set overlay.nlinkup to "1-2"
>   + the first number indicates the target number of linkups
>   + the second indicates the target st_nlink on the file
> - fsync the file (hopefully this writes the xattrs as well as the data)
> - link the file from indexdir to upper
> - set overlay.nlinkup to "1"
>    + this indicates that the linkup was finished and number of linkups is 1.
>
> There are 3 cases when we find a file with overlay.nlinkup:
>
> - just one number: nothing to do
> - two numbers and second number == st_nlink: replace with just the linkup value
> - two numbers and second number != st_nlink: replace with decremented
> linkup value
>
> So the atomicity is guaranteed by st_nlink becoming in sync with the
> target value.
>
> Would need to ensure that ovl_link() is not performed in parallel with
> the linkup procedure.
>

Yep, I started to write you the same thing :)
I though of using a separate xattr for the transnational values,
and delete it after link up, but I guess single xattr is cleaner ??

I guess the new ovl_inode mutex could be used for blocking ovl_link().

Just to be clear:
- we are going to allocate ovl_inode from our own slab cache
- move from using the i_private inode fields to ovl_inode fields:
  {realinode, originode, nlinkup, lock}
- use ovl_inode mutex instead of wait queue and for hardlink up

Correct?

Amir.
Miklos Szeredi June 9, 2017, 1:29 p.m. UTC | #10
On Fri, Jun 9, 2017 at 3:24 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Fri, Jun 9, 2017 at 4:14 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>> On Fri, Jun 9, 2017 at 1:49 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>>> On Fri, Jun 9, 2017 at 11:38 AM, Amir Goldstein <amir73il@gmail.com> wrote:
>>
>>> The challenge is in accounting the number of link-ups atomically with
>>> the link-up itself.  No ideas off-hand.
>>
>> Following would work I think:
>>
>> - copy up to indexdir (it now has st_nlink == 1)
>> - set overlay.nlinkup to "1-2"
>>   + the first number indicates the target number of linkups
>>   + the second indicates the target st_nlink on the file
>> - fsync the file (hopefully this writes the xattrs as well as the data)
>> - link the file from indexdir to upper
>> - set overlay.nlinkup to "1"
>>    + this indicates that the linkup was finished and number of linkups is 1.
>>
>> There are 3 cases when we find a file with overlay.nlinkup:
>>
>> - just one number: nothing to do
>> - two numbers and second number == st_nlink: replace with just the linkup value
>> - two numbers and second number != st_nlink: replace with decremented
>> linkup value
>>
>> So the atomicity is guaranteed by st_nlink becoming in sync with the
>> target value.
>>
>> Would need to ensure that ovl_link() is not performed in parallel with
>> the linkup procedure.
>>
>
> Yep, I started to write you the same thing :)
> I though of using a separate xattr for the transnational values,
> and delete it after link up, but I guess single xattr is cleaner ??
>
> I guess the new ovl_inode mutex could be used for blocking ovl_link().
>
> Just to be clear:
> - we are going to allocate ovl_inode from our own slab cache
> - move from using the i_private inode fields to ovl_inode fields:
>   {realinode, originode, nlinkup, lock}
> - use ovl_inode mutex instead of wait queue and for hardlink up
>
> Correct?

Sounds good.

Thanks,
Miklos
Amir Goldstein June 9, 2017, 10:56 p.m. UTC | #11
On Fri, Jun 9, 2017 at 4:14 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Fri, Jun 9, 2017 at 1:49 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>> On Fri, Jun 9, 2017 at 11:38 AM, Amir Goldstein <amir73il@gmail.com> wrote:
>
>> The challenge is in accounting the number of link-ups atomically with
>> the link-up itself.  No ideas off-hand.
>
> Following would work I think:
>
> - copy up to indexdir (it now has st_nlink == 1)
> - set overlay.nlinkup to "1-2"
>   + the first number indicates the target number of linkups
>   + the second indicates the target st_nlink on the file
> - fsync the file (hopefully this writes the xattrs as well as the data)
> - link the file from indexdir to upper
> - set overlay.nlinkup to "1"
>    + this indicates that the linkup was finished and number of linkups is 1.
>

Or like this:

struct ovl_nlink_adjust {
  int nlink_diff;
  bool from_lower;
}

Init overlay inode:
overlay_nlink := (from_lower ? lower_nlink : upper_nlink) + nlink_diff

Copy/link up:
- take ovl_inode lock
- set overlay.nlink { (overlay_nlink - lower_link), true }
- link index (won't change diff from lower inode nlink on fail/success)

Unlink/link/rename:
- take ovl_inode lock
- set overlay.nlink { (overlay_nlink - upper_nlink), false }
- unlink/link/rename (won't change diff from upper nlink on fail/success)

I don't think fsync matters to this game.
I'd be surprised if the nlink changing operations (link/unlink/rename) can be
reordered with the setxattr operations on a power fail safe fs.

Amir.
diff mbox

Patch

diff --git a/fs/overlayfs/namei.c b/fs/overlayfs/namei.c
index 61f4988527f4..d204488bf23c 100644
--- a/fs/overlayfs/namei.c
+++ b/fs/overlayfs/namei.c
@@ -367,6 +367,84 @@  int ovl_verify_origin(struct dentry *dentry, struct vfsmount *mnt,
 }
 
 /*
+ * Lookup in indexdir for the index entry of a lower real inode or a copy up
+ * origin inode. The index entry name is the hex representation of the lower
+ * inode file handle.
+ *
+ * If the index dentry in negative, then either no lower aliases have been
+ * copied up yet, or aliases have been copied up in older kernels and are
+ * not indexed.
+ *
+ * If the index dentry for a copy up origin inode is positive, but points
+ * to an inode different than the upper inode, then either the upper inode
+ * has been copied up and not indexed or it was indexed, but since then
+ * index dir was cleared. Either way, that index cannot be used to indentify
+ * the overlay inode.
+ */
+static int ovl_lookup_index(struct dentry *dentry, struct dentry *upper,
+			    struct dentry *lower, struct dentry **indexp)
+{
+	struct ovl_fs *ofs = dentry->d_sb->s_fs_info;
+	struct ovl_fh *fh;
+	struct dentry *index = NULL;
+	struct inode *inode;
+	char *s, *name = NULL;
+	long namelen = 0;
+	int err;
+
+	if (WARN_ON(!ofs->indexdir))
+		return -ENOENT;
+
+	fh = ovl_encode_fh(lower, false);
+	if (IS_ERR(fh)) {
+		err = PTR_ERR(fh);
+		fh = NULL;
+		goto fail;
+	}
+
+	err = -ENAMETOOLONG;
+	namelen = fh->len * 2;
+	if (namelen > ofs->namelen)
+		goto fail;
+
+	err = -ENOMEM;
+	name = kzalloc(namelen + 1, GFP_TEMPORARY);
+	if (!name)
+		goto fail;
+
+	s  = bin2hex(name, fh, fh->len);
+	namelen = s - name;
+
+	index = lookup_one_len_unlocked(name, ofs->indexdir, namelen);
+	if (IS_ERR(index)) {
+		err = PTR_ERR(index);
+		goto fail;
+	}
+
+	if (upper && d_inode(index) != d_inode(upper)) {
+		inode = d_inode(index);
+		pr_debug("ovl_lookup_index: upper with origin not indexed (%pd2, ino=%lu)\n",
+			 upper, inode ? inode->i_ino : 0);
+		dput(index);
+		index = NULL;
+	}
+
+	*indexp = index;
+	err = 0;
+
+out:
+	kfree(fh);
+	kfree(name);
+	return err;
+
+fail:
+	inode = d_inode(lower);
+	pr_warn_ratelimited("overlayfs: failed inode index lookup (ino=%lu, key=%*s, err=%i); mount with '-o index=off' to disable inodes index.\n",
+			    inode ? inode->i_ino : 0, (int)namelen, name, err);
+	goto out;
+}
+
+/*
  * Returns next layer in stack starting from top.
  * Returns -1 if this is the last layer.
  */
@@ -400,6 +478,7 @@  struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
 	struct ovl_entry *roe = dentry->d_sb->s_root->d_fsdata;
 	struct path *stack = NULL;
 	struct dentry *upperdir, *upperdentry = NULL;
+	struct dentry *indexdentry = NULL;
 	unsigned int ctr = 0;
 	struct inode *inode = NULL;
 	bool upperopaque = false;
@@ -500,6 +579,14 @@  struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
 		}
 	}
 
+	/* Lookup index by lower inode and verify it matches upper inode */
+	if (ctr && !d.is_dir && ovl_indexdir(dentry->d_sb)) {
+		err = ovl_lookup_index(dentry, upperdentry, stack[0].dentry,
+				       &indexdentry);
+		if (err)
+			goto out_put;
+	}
+
 	oe = ovl_alloc_entry(ctr);
 	err = -ENOMEM;
 	if (!oe)
@@ -531,6 +618,7 @@  struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
 	oe->impure = upperimpure;
 	oe->redirect = upperredirect;
 	oe->__upperdentry = upperdentry;
+	oe->indexdentry = indexdentry;
 	memcpy(oe->lowerstack, stack, sizeof(struct path) * ctr);
 	kfree(stack);
 	kfree(d.redirect);
@@ -542,6 +630,7 @@  struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
 out_free_oe:
 	kfree(oe);
 out_put:
+	dput(indexdentry);
 	for (i = 0; i < ctr; i++)
 		dput(stack[i].dentry);
 	kfree(stack);
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index 6fd7c9e748c1..47d62c585c96 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -14,11 +14,13 @@  enum ovl_path_type {
 	__OVL_PATH_UPPER	= (1 << 0),
 	__OVL_PATH_MERGE	= (1 << 1),
 	__OVL_PATH_ORIGIN	= (1 << 2),
+	__OVL_PATH_INDEX	= (1 << 3),
 };
 
 #define OVL_TYPE_UPPER(type)	((type) & __OVL_PATH_UPPER)
 #define OVL_TYPE_MERGE(type)	((type) & __OVL_PATH_MERGE)
 #define OVL_TYPE_ORIGIN(type)	((type) & __OVL_PATH_ORIGIN)
+#define OVL_TYPE_INDEX(type)	((type) & __OVL_PATH_INDEX)
 
 #define OVL_XATTR_PREFIX XATTR_TRUSTED_PREFIX "overlay."
 #define OVL_XATTR_OPAQUE OVL_XATTR_PREFIX "opaque"
@@ -204,6 +206,7 @@  void ovl_path_lower(struct dentry *dentry, struct path *path);
 enum ovl_path_type ovl_path_real(struct dentry *dentry, struct path *path);
 struct dentry *ovl_dentry_upper(struct dentry *dentry);
 struct dentry *ovl_dentry_lower(struct dentry *dentry);
+struct dentry *ovl_dentry_index(struct dentry *dentry);
 struct dentry *ovl_dentry_real(struct dentry *dentry);
 struct ovl_dir_cache *ovl_dir_cache(struct dentry *dentry);
 void ovl_set_dir_cache(struct dentry *dentry, struct ovl_dir_cache *cache);
diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
index ddd4b0a874a9..dc08ce5c4ac0 100644
--- a/fs/overlayfs/ovl_entry.h
+++ b/fs/overlayfs/ovl_entry.h
@@ -43,6 +43,7 @@  struct ovl_fs {
 /* private information held for every overlayfs dentry */
 struct ovl_entry {
 	struct dentry *__upperdentry;
+	struct dentry *indexdentry;
 	struct ovl_dir_cache *cache;
 	union {
 		struct {
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index b411b6d5f7b1..b53c08f70b1a 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -47,6 +47,7 @@  static void ovl_dentry_release(struct dentry *dentry)
 		unsigned int i;
 
 		dput(oe->__upperdentry);
+		dput(oe->indexdentry);
 		kfree(oe->redirect);
 		for (i = 0; i < oe->numlower; i++)
 			dput(oe->lowerstack[i].dentry);
diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index 8e2f308056f1..6048f204be07 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -108,6 +108,10 @@  enum ovl_path_type ovl_path_type(struct dentry *dentry)
 		if (oe->numlower > 1)
 			type |= __OVL_PATH_MERGE;
 	}
+	/* Non-upper can have a positive index dentry from another hardlink */
+	if (oe->indexdentry && d_inode(oe->indexdentry))
+		type |= __OVL_PATH_INDEX;
+
 	return type;
 }
 
@@ -158,6 +162,13 @@  struct dentry *ovl_dentry_lower(struct dentry *dentry)
 	return __ovl_dentry_lower(oe);
 }
 
+struct dentry *ovl_dentry_index(struct dentry *dentry)
+{
+	struct ovl_entry *oe = dentry->d_fsdata;
+
+	return oe->indexdentry;
+}
+
 struct dentry *ovl_dentry_real(struct dentry *dentry)
 {
 	struct ovl_entry *oe = dentry->d_fsdata;