diff mbox

[v2,01/17] ovl: document NFS export

Message ID 1515086449-26563-2-git-send-email-amir73il@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Amir Goldstein Jan. 4, 2018, 5:20 p.m. UTC
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 Documentation/filesystems/overlayfs.txt | 59 +++++++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

Comments

Miklos Szeredi Jan. 11, 2018, 4:06 p.m. UTC | #1
On Thu, Jan 4, 2018 at 6:20 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
>  Documentation/filesystems/overlayfs.txt | 59 +++++++++++++++++++++++++++++++++
>  1 file changed, 59 insertions(+)
>
> diff --git a/Documentation/filesystems/overlayfs.txt b/Documentation/filesystems/overlayfs.txt
> index 00e0595f3d7e..9e21c14c914c 100644
> --- a/Documentation/filesystems/overlayfs.txt
> +++ b/Documentation/filesystems/overlayfs.txt
> @@ -315,6 +315,65 @@ origin file handle that was stored at copy_up time.  If a found lower
>  directory does not match the stored origin, that directory will not be
>  merged with the upper directory.
>
> +
> +NFS export
> +----------
> +
> +When the underlying filesystems supports NFS export and the "verify"
> +feature is enabled, an overlay filesystem may be exported to NFS.
> +
> +With the "verify" feature, on copy_up of any lower object, an index
> +entry is created under the index directory.  The index entry name is the
> +hexadecimal representation of the copy up origin file handle.  For a
> +non-directory object, the index entry is a hard link to the upper inode.
> +For a directory object, the index entry has an extended attribute
> +"trusted.overlay.origin" with an encoded file handle of the upper
> +directory inode.
> +
> +When encoding a file handle from an overlay filesystem object, the
> +following rules apply:
> +
> +1. For a non-upper object, encode a lower file handle from lower inode
> +2. For an indexed object, encode a lower file handle from copy_up origin
> +3. For a pure-upper object and for an existing non-indexed upper object,
> +   encode an upper file handle from upper inode
> +
> +Encoding of a non-upper directory object is not supported when overlay
> +filesystem has multiple lower layers.  In this case, the directory will
> +be copied up first, and then encoded as an upper file handle.

Why?

What's the difference from encoding the uppermost lower layer directory?

> +
> +The encoded overlay file handle includes:
> + - Header including path type information (e.g. lower/upper)
> + - UUID of the underlying filesystem
> + - Underlying filesystem encoding of underlying inode
> +
> +This encoding is identical to the encoding of copy_up origin stored in
> +"trusted.overlay.origin".
> +
> +When decoding an overlay file handle, the following steps are followed:
> +
> +1. Find underlying layer by UUID and path type information.
> +2. Decode the underlying filesystem file handle to underlying dentry.
> +3. For a lower file handle, lookup the handle in index directory by name.
> +4. If a whiteout is found in index, return ESTALE. This represents an
> +   overlay object that was deleted after its file handle was encoded.
> +5. For a non-directory, instantiate a disconnected overlay dentry from the
> +   decoded underlying dentry, the path type and index inode, if found.
> +6. For a directory, use the connected underlying decoded dentry, path type
> +   and index, to lookup a connected overlay dentry.
> +
> +The "verify" feature ensures, that a decoded overlay directory object will
> +be equivalent to the object that was used to encode the file handle.
> +

What's equivalent?  What are the guarantees needed by NFS server?  It
doesn't verify object version, so modification is OK.

Does swapping out lower dirs count as modification or does it count as
new object?

Thanks,
Miklos
Amir Goldstein Jan. 11, 2018, 4:26 p.m. UTC | #2
On Thu, Jan 11, 2018 at 6:06 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Thu, Jan 4, 2018 at 6:20 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>> ---
>>  Documentation/filesystems/overlayfs.txt | 59 +++++++++++++++++++++++++++++++++
>>  1 file changed, 59 insertions(+)
>>
>> diff --git a/Documentation/filesystems/overlayfs.txt b/Documentation/filesystems/overlayfs.txt
>> index 00e0595f3d7e..9e21c14c914c 100644
>> --- a/Documentation/filesystems/overlayfs.txt
>> +++ b/Documentation/filesystems/overlayfs.txt
>> @@ -315,6 +315,65 @@ origin file handle that was stored at copy_up time.  If a found lower
>>  directory does not match the stored origin, that directory will not be
>>  merged with the upper directory.
>>
>> +
>> +NFS export
>> +----------
>> +
>> +When the underlying filesystems supports NFS export and the "verify"
>> +feature is enabled, an overlay filesystem may be exported to NFS.
>> +
>> +With the "verify" feature, on copy_up of any lower object, an index
>> +entry is created under the index directory.  The index entry name is the
>> +hexadecimal representation of the copy up origin file handle.  For a
>> +non-directory object, the index entry is a hard link to the upper inode.
>> +For a directory object, the index entry has an extended attribute
>> +"trusted.overlay.origin" with an encoded file handle of the upper
>> +directory inode.
>> +
>> +When encoding a file handle from an overlay filesystem object, the
>> +following rules apply:
>> +
>> +1. For a non-upper object, encode a lower file handle from lower inode
>> +2. For an indexed object, encode a lower file handle from copy_up origin
>> +3. For a pure-upper object and for an existing non-indexed upper object,
>> +   encode an upper file handle from upper inode
>> +
>> +Encoding of a non-upper directory object is not supported when overlay
>> +filesystem has multiple lower layers.  In this case, the directory will
>> +be copied up first, and then encoded as an upper file handle.
>
> Why?
>
> What's the difference from encoding the uppermost lower layer directory?

Sigh... hard to document... here goes an attempt.
Let me know if it works:

When decoding an upper dir, the decoded upper path is the same path as
the overlay path, so we lookup same path in overlay.

When decoding a lower dir from layer 1, every ancestor is either still lower
(and therefore not renamed) or been copied up and indexed by lower inode,
so we can use index to know the path of every ancestor in overlay (or if it
has been removed).

When decoding a lower dir from layer 2, there may be an ancestor in layer 2
covered by whiteout in layer 1 and redirected from another directory in layer 1.
In that case, we have no information in index to reconstruct the overlay path
from the connected layer 2 directory, hence, we cannot decode a connected
overlay directory from dir file handle encoded from layer 2.

Copy up on encode mitigates this problem, because it hops over the non
indexed redirects.

BTW, same thing could happen with dir file handle from layer 1 when exporting
an overlay that has existing non-indexed merge dirs.

>
>> +
>> +The encoded overlay file handle includes:
>> + - Header including path type information (e.g. lower/upper)
>> + - UUID of the underlying filesystem
>> + - Underlying filesystem encoding of underlying inode
>> +
>> +This encoding is identical to the encoding of copy_up origin stored in
>> +"trusted.overlay.origin".
>> +
>> +When decoding an overlay file handle, the following steps are followed:
>> +
>> +1. Find underlying layer by UUID and path type information.
>> +2. Decode the underlying filesystem file handle to underlying dentry.
>> +3. For a lower file handle, lookup the handle in index directory by name.
>> +4. If a whiteout is found in index, return ESTALE. This represents an
>> +   overlay object that was deleted after its file handle was encoded.
>> +5. For a non-directory, instantiate a disconnected overlay dentry from the
>> +   decoded underlying dentry, the path type and index inode, if found.
>> +6. For a directory, use the connected underlying decoded dentry, path type
>> +   and index, to lookup a connected overlay dentry.
>> +
>> +The "verify" feature ensures, that a decoded overlay directory object will
>> +be equivalent to the object that was used to encode the file handle.
>> +
>
> What's equivalent?  What are the guarantees needed by NFS server?  It
> doesn't verify object version, so modification is OK.
>
> Does swapping out lower dirs count as modification or does it count as
> new object?
>

To be honest, I don't know what I was trying to say.
In the updated version of patches and documentation I just pushed to
https://github.com/amir73il/linux/commits/ovl-nfs-export
this obscure sentence is gone.

It there anything else that needs clarification?

Thanks,
Amir.
Miklos Szeredi Jan. 12, 2018, 3:43 p.m. UTC | #3
On Thu, Jan 11, 2018 at 5:26 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Thu, Jan 11, 2018 at 6:06 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>> On Thu, Jan 4, 2018 at 6:20 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>>> ---
>>>  Documentation/filesystems/overlayfs.txt | 59 +++++++++++++++++++++++++++++++++
>>>  1 file changed, 59 insertions(+)
>>>
>>> diff --git a/Documentation/filesystems/overlayfs.txt b/Documentation/filesystems/overlayfs.txt
>>> index 00e0595f3d7e..9e21c14c914c 100644
>>> --- a/Documentation/filesystems/overlayfs.txt
>>> +++ b/Documentation/filesystems/overlayfs.txt
>>> @@ -315,6 +315,65 @@ origin file handle that was stored at copy_up time.  If a found lower
>>>  directory does not match the stored origin, that directory will not be
>>>  merged with the upper directory.
>>>
>>> +
>>> +NFS export
>>> +----------
>>> +
>>> +When the underlying filesystems supports NFS export and the "verify"
>>> +feature is enabled, an overlay filesystem may be exported to NFS.
>>> +
>>> +With the "verify" feature, on copy_up of any lower object, an index
>>> +entry is created under the index directory.  The index entry name is the
>>> +hexadecimal representation of the copy up origin file handle.  For a
>>> +non-directory object, the index entry is a hard link to the upper inode.
>>> +For a directory object, the index entry has an extended attribute
>>> +"trusted.overlay.origin" with an encoded file handle of the upper
>>> +directory inode.
>>> +
>>> +When encoding a file handle from an overlay filesystem object, the
>>> +following rules apply:
>>> +
>>> +1. For a non-upper object, encode a lower file handle from lower inode
>>> +2. For an indexed object, encode a lower file handle from copy_up origin
>>> +3. For a pure-upper object and for an existing non-indexed upper object,
>>> +   encode an upper file handle from upper inode
>>> +
>>> +Encoding of a non-upper directory object is not supported when overlay
>>> +filesystem has multiple lower layers.  In this case, the directory will
>>> +be copied up first, and then encoded as an upper file handle.
>>
>> Why?
>>
>> What's the difference from encoding the uppermost lower layer directory?
>
> Sigh... hard to document... here goes an attempt.
> Let me know if it works:
>
> When decoding an upper dir, the decoded upper path is the same path as
> the overlay path, so we lookup same path in overlay.
>
> When decoding a lower dir from layer 1, every ancestor is either still lower
> (and therefore not renamed) or been copied up and indexed by lower inode,
> so we can use index to know the path of every ancestor in overlay (or if it
> has been removed).
>
> When decoding a lower dir from layer 2, there may be an ancestor in layer 2
> covered by whiteout in layer 1 and redirected from another directory in layer 1.
> In that case, we have no information in index to reconstruct the overlay path
> from the connected layer 2 directory, hence, we cannot decode a connected
> overlay directory from dir file handle encoded from layer 2.

Now I understand: we are missing the back pointer from layer2 to
layer1 that the index provides us when going from lower to upper.

However, this is only needed if we end up below a redirecting layer.
So we could limit copy-up to these cases.  It doesn't seem hard to
keep track of highest layer that had a redirect in each overlay
dentry, and when ending up on a layer below that, mark the overlay
dentry COPY_UP_FOR_ENCODE.  This information is constant, since lower
layers are immutable, so no worries there.  Can postpone this to a
later version, but the takeaway is that we need to mark the fh to
indicate if it's a merge upper or not.

Thanks,
Miklos
Miklos Szeredi Jan. 12, 2018, 3:49 p.m. UTC | #4
On Fri, Jan 12, 2018 at 4:43 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Thu, Jan 11, 2018 at 5:26 PM, Amir Goldstein <amir73il@gmail.com> wrote:

>> When decoding an upper dir, the decoded upper path is the same path as
>> the overlay path, so we lookup same path in overlay.
>>
>> When decoding a lower dir from layer 1, every ancestor is either still lower
>> (and therefore not renamed) or been copied up and indexed by lower inode,
>> so we can use index to know the path of every ancestor in overlay (or if it
>> has been removed).
>>
>> When decoding a lower dir from layer 2, there may be an ancestor in layer 2
>> covered by whiteout in layer 1 and redirected from another directory in layer 1.
>> In that case, we have no information in index to reconstruct the overlay path
>> from the connected layer 2 directory, hence, we cannot decode a connected
>> overlay directory from dir file handle encoded from layer 2.
>
> Now I understand: we are missing the back pointer from layer2 to
> layer1 that the index provides us when going from lower to upper.
>
> However, this is only needed if we end up below a redirecting layer.
> So we could limit copy-up to these cases.  It doesn't seem hard to
> keep track of highest layer that had a redirect in each overlay
> dentry, and when ending up on a layer below that, mark the overlay
> dentry COPY_UP_FOR_ENCODE.  This information is constant, since lower
> layers are immutable, so no worries there.  Can postpone this to a
> later version, but the takeaway is that we need to mark the fh to
> indicate if it's a merge upper or not.

And BTW, we need to copy up only the directory that has the redirect,
since that's where we are missing the mapping in the lower layers.
Below that in the tree, we are fine, until we come across another
redirect, and so on...

Thanks,
Miklos
Amir Goldstein Jan. 12, 2018, 6:50 p.m. UTC | #5
On Fri, Jan 12, 2018 at 5:49 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Fri, Jan 12, 2018 at 4:43 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>> On Thu, Jan 11, 2018 at 5:26 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>
>>> When decoding an upper dir, the decoded upper path is the same path as
>>> the overlay path, so we lookup same path in overlay.
>>>
>>> When decoding a lower dir from layer 1, every ancestor is either still lower
>>> (and therefore not renamed) or been copied up and indexed by lower inode,
>>> so we can use index to know the path of every ancestor in overlay (or if it
>>> has been removed).
>>>
>>> When decoding a lower dir from layer 2, there may be an ancestor in layer 2
>>> covered by whiteout in layer 1 and redirected from another directory in layer 1.
>>> In that case, we have no information in index to reconstruct the overlay path
>>> from the connected layer 2 directory, hence, we cannot decode a connected
>>> overlay directory from dir file handle encoded from layer 2.
>>
>> Now I understand: we are missing the back pointer from layer2 to
>> layer1 that the index provides us when going from lower to upper.
>>
>> However, this is only needed if we end up below a redirecting layer.
>> So we could limit copy-up to these cases.  It doesn't seem hard to
>> keep track of highest layer that had a redirect in each overlay
>> dentry, and when ending up on a layer below that, mark the overlay
>> dentry COPY_UP_FOR_ENCODE.  This information is constant, since lower
>> layers are immutable, so no worries there.

Right.

>> Can postpone this to a
>> later version, but the takeaway is that we need to mark the fh to
>> indicate if it's a merge upper or not.
>

This I did not get.
The fh is marked upper or not.
If it is upper, we get the real upper path and lookup that path in overlay.
Whether upper is merge or not, overlay lookup will find out.

What am I missing?


> And BTW, we need to copy up only the directory that has the redirect,
> since that's where we are missing the mapping in the lower layers.
> Below that in the tree, we are fine, until we come across another
> redirect, and so on...
>

So actually, I can use OVL_RENAMED flag from patch 8/23
and implement ovl_copy_up_renamed_parent() on encode
This will actually also cover the case of dir in layer1 that has a
non-indexed redirected upper parent.

Thanks,
Amir.
Amir Goldstein Jan. 13, 2018, 8:54 a.m. UTC | #6
On Fri, Jan 12, 2018 at 5:49 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Fri, Jan 12, 2018 at 4:43 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>> On Thu, Jan 11, 2018 at 5:26 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>
>>> When decoding an upper dir, the decoded upper path is the same path as
>>> the overlay path, so we lookup same path in overlay.
>>>
>>> When decoding a lower dir from layer 1, every ancestor is either still lower
>>> (and therefore not renamed) or been copied up and indexed by lower inode,
>>> so we can use index to know the path of every ancestor in overlay (or if it
>>> has been removed).
>>>
>>> When decoding a lower dir from layer 2, there may be an ancestor in layer 2
>>> covered by whiteout in layer 1 and redirected from another directory in layer 1.
>>> In that case, we have no information in index to reconstruct the overlay path
>>> from the connected layer 2 directory, hence, we cannot decode a connected
>>> overlay directory from dir file handle encoded from layer 2.
>>
>> Now I understand: we are missing the back pointer from layer2 to
>> layer1 that the index provides us when going from lower to upper.
>>
>> However, this is only needed if we end up below a redirecting layer.
>> So we could limit copy-up to these cases.  It doesn't seem hard to
>> keep track of highest layer that had a redirect in each overlay
>> dentry, and when ending up on a layer below that, mark the overlay
>> dentry COPY_UP_FOR_ENCODE.  This information is constant, since lower
>> layers are immutable, so no worries there.  Can postpone this to a
>> later version, but the takeaway is that we need to mark the fh to
>> indicate if it's a merge upper or not.
>

I did not understand what you mean by marking to fh merge upper or not.
In any case, the idea is to mark the dentries like these as ENCODE_UPPER,
then not only they will be copied up on encode, but also *always* encoded as
upper for consistency.

> And BTW, we need to copy up only the directory that has the redirect,
> since that's where we are missing the mapping in the lower layers.
> Below that in the tree, we are fine, until we come across another
> redirect, and so on...
>

I think that is a somewhat simplified description of the situation.
Things can be more complicated, for example:

layer1: /a/b (/a has redirect to 'A')
layer2: /A/b/c/d
layer3: /A/b/c/d/e/f

When decoding the lower path /A/b/c/d, copy up of /a will index by
layer1 dir /a and doesn't help with backward redirect from layer2 dir /A.

Copy up of layer1 /a/b doesn't help either.

We must find the ancestor of /A/b/c/d which is an 'uppermost lower',
which is /A/b/c, and copy up/index that ancestor.

So I *think* we need to store in lookup per dentry:
- reconnect_layer_idx:
The highest layer with a non-indexed redirect (can be upper layer
in case of a non-indexed upper merge dir) among all ancestors.
If we encode a file handle from dir in reconnect_layer or above,
we can decode it and use decoded path to reconnect overlay dentry.
- OVL_ENCODE_UPPER
This is determined by combination of reconnect_layer_idx, the
uppermost lower layer of self and uppermost lower layer of parent.
I *think* the condition is:
lowerpath[0]->layer->idx > parent->lowerpath[0]->layer->idx &&
lowerpath[0]->layer->idx > reconnect_layer_idx

In the example above, dentries /a/b/c and /a/b/c/d/e are marked
OVL_ENCODE_UPPER. /a/b/c should be copied up when encoding
/a/b/c/d and /a/b/c/d/e should be copied up when encoding /a/b/c/d/e/f.
In reality, I assume nfsd always encodes /a/b/c on lookup of
/a/b/c/d before encoding /a/b/c/d, but to be on the safe side, we
need to take care of copy up of OVL_ENCODE_UPPER ancestor.


This is my re-take on Documentation of this wrinkle:

When overlay filesystem has multiple lower layers, a middle layer
directory may have a "redirect" to lower directory.  Because middle layer
"redirects" are not indexed, a lower file handle that was encoded from the
"redirect" origin directory, cannot be used to find the middle or upper
layer directory.  Similarly, a lower file handle that was encoded from a
descendant of the "redirect" origin directory, cannot be used to
reconstruct a connected overlay path.  To mitigate the cases of
directories that cannot be decoded from a lower file handle, these
directories are copied up on encode and encoded as an upper file handle.

Let me know what you think.

Thanks,
Amir.
diff mbox

Patch

diff --git a/Documentation/filesystems/overlayfs.txt b/Documentation/filesystems/overlayfs.txt
index 00e0595f3d7e..9e21c14c914c 100644
--- a/Documentation/filesystems/overlayfs.txt
+++ b/Documentation/filesystems/overlayfs.txt
@@ -315,6 +315,65 @@  origin file handle that was stored at copy_up time.  If a found lower
 directory does not match the stored origin, that directory will not be
 merged with the upper directory.
 
+
+NFS export
+----------
+
+When the underlying filesystems supports NFS export and the "verify"
+feature is enabled, an overlay filesystem may be exported to NFS.
+
+With the "verify" feature, on copy_up of any lower object, an index
+entry is created under the index directory.  The index entry name is the
+hexadecimal representation of the copy up origin file handle.  For a
+non-directory object, the index entry is a hard link to the upper inode.
+For a directory object, the index entry has an extended attribute
+"trusted.overlay.origin" with an encoded file handle of the upper
+directory inode.
+
+When encoding a file handle from an overlay filesystem object, the
+following rules apply:
+
+1. For a non-upper object, encode a lower file handle from lower inode
+2. For an indexed object, encode a lower file handle from copy_up origin
+3. For a pure-upper object and for an existing non-indexed upper object,
+   encode an upper file handle from upper inode
+
+Encoding of a non-upper directory object is not supported when overlay
+filesystem has multiple lower layers.  In this case, the directory will
+be copied up first, and then encoded as an upper file handle.
+
+The encoded overlay file handle includes:
+ - Header including path type information (e.g. lower/upper)
+ - UUID of the underlying filesystem
+ - Underlying filesystem encoding of underlying inode
+
+This encoding is identical to the encoding of copy_up origin stored in
+"trusted.overlay.origin".
+
+When decoding an overlay file handle, the following steps are followed:
+
+1. Find underlying layer by UUID and path type information.
+2. Decode the underlying filesystem file handle to underlying dentry.
+3. For a lower file handle, lookup the handle in index directory by name.
+4. If a whiteout is found in index, return ESTALE. This represents an
+   overlay object that was deleted after its file handle was encoded.
+5. For a non-directory, instantiate a disconnected overlay dentry from the
+   decoded underlying dentry, the path type and index inode, if found.
+6. For a directory, use the connected underlying decoded dentry, path type
+   and index, to lookup a connected overlay dentry.
+
+The "verify" feature ensures, that a decoded overlay directory object will
+be equivalent to the object that was used to encode the file handle.
+
+Decoding a non-directory file handle may return a disconnected dentry.
+copy_up of that disconnected dentry will create an upper index entry with
+no upper alias.
+
+The overlay filesystem does not support non-directory connectable file
+handles, so exporting with the 'subtree_check' exportfs configuration will
+cause failures to lookup files over NFS.
+
+
 Testsuite
 ---------