mbox series

[RFC,00/14] ceph+fscrypt: together at last (contexts and filenames)

Message ID 20200821182813.52570-1-jlayton@kernel.org (mailing list archive)
Headers show
Series ceph+fscrypt: together at last (contexts and filenames) | expand

Message

Jeff Layton Aug. 21, 2020, 6:27 p.m. UTC
This is a (very rough and incomplete) draft patchset that I've been
working on to add fscrypt support to cephfs. The main use case is being
able to allow encryption at the edges, without having to trust your storage
provider with keys.

Implementing fscrypt on a network filesystem has some challenges that
you don't have to deal with on a local fs:

Ceph (and most other netfs') will need to pre-create a crypto context
when creating a new inode as we'll need to encrypt some things before we
have an inode. This patchset stores contexts in an xattr, but that's
probably not ideal for the final implementation [1].

Storing a binary crypttext filename on the MDS (or most network
fileservers) may be problematic. We'll probably end up having to base64
encode the names when storing them. I expect most network filesystems to
have similar issues. That may limit the effective NAME_MAX for some
filesystems [2].

For content encryption, Ceph (and probably at least CIFS/SMB) will need
to deal with writes not aligned on crypto blocks. These filesystems
sometimes write synchronously to the server instead of going through the
pagecache [3].

Symlink handling in fscrypt will also need to be refactored a bit, as we
won't have an inode before we'll need to encrypt its contents.

This draft is _very_ rough and not ready for merge. This only covers the
context handling and filename encryption. It's missing a lot of stuff
still, but what's there basically works.

I'm mostly posting this now to get some early feedback on the basic idea
and approach. In particular, I'd appreciate some feedback from the
fscrypt maintainers. Please let me know if any of the changes I'm
proposing there look problematic.

Thanks for looking!
-- Jeff

[1]: We'll likely add a dedicated field to the standard on-the-wire
inode representation and in the MDS, but that's a separate sub-project.

[2]: For ceph, it looks like it's not a huge problem as the MDS doesn't
enforce filename lengths at all. Still, we may extend the protocol to
handle that better.

[3]: For Ceph, I think we'll be able to do a CMPEXT op to do a
read/modify/write-if-nothing-changed cycle for this case.

Jeff Layton (14):
  fscrypt: drop unused inode argument from fscrypt_fname_alloc_buffer
  fscrypt: add fscrypt_new_context_from_parent
  fscrypt: don't balk when inode is already marked encrypted
  fscrypt: export fscrypt_d_revalidate
  lib: lift fscrypt base64 conversion into lib/
  ceph: add fscrypt ioctls
  ceph: crypto context handling for ceph
  ceph: add routine to create context prior to RPC
  ceph: set S_ENCRYPTED bit if new inode has encryption.ctx xattr
  ceph: make ceph_msdc_build_path use ref-walk
  ceph: add encrypted fname handling to ceph_mdsc_build_path
  ceph: make d_revalidate call fscrypt revalidator for encrypted
    dentries
  ceph: add support to readdir for encrypted filenames
  ceph: add fscrypt support to ceph_fill_trace

 fs/ceph/Makefile        |   1 +
 fs/ceph/crypto.c        | 171 ++++++++++++++++++++++++++++++++++++++++
 fs/ceph/crypto.h        |  81 +++++++++++++++++++
 fs/ceph/dir.c           |  97 ++++++++++++++++++++---
 fs/ceph/file.c          |   4 +
 fs/ceph/inode.c         |  62 +++++++++++++--
 fs/ceph/ioctl.c         |  26 ++++++
 fs/ceph/mds_client.c    |  74 ++++++++++++-----
 fs/ceph/super.c         |  37 +++++++++
 fs/ceph/super.h         |  11 ++-
 fs/ceph/xattr.c         |  32 ++++++++
 fs/crypto/Kconfig       |   1 +
 fs/crypto/fname.c       |  67 +---------------
 fs/crypto/hooks.c       |   2 +-
 fs/crypto/keysetup.c    |   2 +-
 fs/crypto/policy.c      |  42 +++++++---
 fs/ext4/dir.c           |   2 +-
 fs/ext4/namei.c         |   7 +-
 fs/f2fs/dir.c           |   2 +-
 fs/ubifs/dir.c          |   2 +-
 include/linux/base64.h  |  11 +++
 include/linux/fscrypt.h |  12 ++-
 lib/Kconfig             |   3 +
 lib/Makefile            |   1 +
 lib/base64.c            |  71 +++++++++++++++++
 25 files changed, 696 insertions(+), 125 deletions(-)
 create mode 100644 fs/ceph/crypto.c
 create mode 100644 fs/ceph/crypto.h
 create mode 100644 include/linux/base64.h
 create mode 100644 lib/base64.c

Comments

Eric Biggers Aug. 22, 2020, 12:23 a.m. UTC | #1
Hi Jeff,

On Fri, Aug 21, 2020 at 02:27:59PM -0400, Jeff Layton wrote:
> This is a (very rough and incomplete) draft patchset that I've been
> working on to add fscrypt support to cephfs. The main use case is being
> able to allow encryption at the edges, without having to trust your storage
> provider with keys.

This is very interesting work -- thanks for sending this out!

> Implementing fscrypt on a network filesystem has some challenges that
> you don't have to deal with on a local fs:
> 
> Ceph (and most other netfs') will need to pre-create a crypto context
> when creating a new inode as we'll need to encrypt some things before we
> have an inode. This patchset stores contexts in an xattr, but that's
> probably not ideal for the final implementation [1].

Coincidentally, I've currently working on solving a similar problem.  On ext4,
the inode number can't be assigned, and the encryption xattr can't be set, until
the jbd2 transaction which creates the inode.  Also, if the new inode is a
symlink, then fscrypt_encrypt_symlink() has to be called during the transaction.
Together, these imply that fscrypt_get_encryption_info() has to be called during
the transaction.

That's what we do, currently.  However, it's technically wrong and can deadlock,
since fscrypt_get_encryption_info() isn't GFP_NOFS-safe (and it can't be).

f2fs appears to have a similar problem, though I'm still investigating.

To fix this, I'm planning to add new functions:

   - fscrypt_prepare_new_inode() will set up the fscrypt_info for a new
     'struct inode' which hasn't necessarily had an inode number assigned yet.
     It won't set the encryption xattr yet.

   - fscrypt_set_context() will set the encryption xattr, using the fscrypt_info
     that fscrypt_prepare_new_inode() created earlier.  It will replace
     fscrypt_inherit_context().

I'm working on my patches at
https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/log/?h=wip-fscrypt.
They're not ready to send out yet, but I'll Cc you when I do.

It seems there's still something a bit different that you need: you want
fs/crypto/ to provide a buffer containing the encryption xattr (presumably
because ceph needs to package it up into a single network request that creates
the file?), instead of calling a function which then uses
fscrypt_operations::set_context().  I could pretty easily handle that by adding
a function that returns the xattr directly and would be an alternative to
fscrypt_set_context().

> Storing a binary crypttext filename on the MDS (or most network
> fileservers) may be problematic. We'll probably end up having to base64
> encode the names when storing them. I expect most network filesystems to
> have similar issues. That may limit the effective NAME_MAX for some
> filesystems [2].

I strongly recommend keeping support for the full NAME_MAX (255 bytes), if it's
at all possible.  eCryptfs limited filenames to 143 bytes, which has
historically caused lots of problems.  Try the following Google search to get a
sense of the large number of users that have run into this limitation:
https://www.google.com/search?q=ecryptfs+143+filename

> 
> For content encryption, Ceph (and probably at least CIFS/SMB) will need
> to deal with writes not aligned on crypto blocks. These filesystems
> sometimes write synchronously to the server instead of going through the
> pagecache [3].

I/O that isn't aligned to the "encryption data unit size" (which on the
filesystems that currently support fscrypt, is the same thing as the
"filesystem block size") isn't really possible unless it's buffered.  For
AES-XTS specifically, *in principle* it's possible to encrypt/decrypt an
individual 16-byte aligned region.  But Linux's crypto API doesn't currently
support sub-message crypto, and also fscrypt supports the AES-CBC and Adiantum
encryption modes which have stricter requirements.

So, I think that reads/writes to encrypted files will always need to go through
the page cache.

> 
> Symlink handling in fscrypt will also need to be refactored a bit, as we
> won't have an inode before we'll need to encrypt its contents.

Will there be an in-memory inode allocated yet (a 'struct inode'), just with no
inode number assigned yet?  If so, my work-in-progress patchset I mentioned
earlier should be sufficient to address this.  The order would be:

	1. fscrypt_prepare_new_inode()
	2. fscrypt_encrypt_symlink()
	3. Assign inode number

Or does ceph not have a 'struct inode' at all until step (3)?

- Eric
Jeff Layton Aug. 22, 2020, 12:58 a.m. UTC | #2
On Fri, 2020-08-21 at 17:23 -0700, Eric Biggers wrote:
> Hi Jeff,
> 
> On Fri, Aug 21, 2020 at 02:27:59PM -0400, Jeff Layton wrote:
> > This is a (very rough and incomplete) draft patchset that I've been
> > working on to add fscrypt support to cephfs. The main use case is being
> > able to allow encryption at the edges, without having to trust your storage
> > provider with keys.
> 
> This is very interesting work -- thanks for sending this out!
> 

Thanks. As I said, it's still very rough at this point, but I'm hoping
to have something ready for v5.11. I think we'll need to plumb some
support into the MDS too, so getting all of that lined up may take
longer.

> > Implementing fscrypt on a network filesystem has some challenges that
> > you don't have to deal with on a local fs:
> > 
> > Ceph (and most other netfs') will need to pre-create a crypto context
> > when creating a new inode as we'll need to encrypt some things before we
> > have an inode. This patchset stores contexts in an xattr, but that's
> > probably not ideal for the final implementation [1].
> 
> Coincidentally, I've currently working on solving a similar problem.  On ext4,
> the inode number can't be assigned, and the encryption xattr can't be set, until
> the jbd2 transaction which creates the inode.  Also, if the new inode is a
> symlink, then fscrypt_encrypt_symlink() has to be called during the transaction.
> Together, these imply that fscrypt_get_encryption_info() has to be called during
> the transaction.
> 

Yes, similar problem. I started looking at symlinks today, and got a
little ways into a patchset to refactor some fscrypt code to handle
them, but I don't think it's quite right yet. A more general solution
would be nice.

> That's what we do, currently.  However, it's technically wrong and can deadlock,
> since fscrypt_get_encryption_info() isn't GFP_NOFS-safe (and it can't be).
> 
> f2fs appears to have a similar problem, though I'm still investigating.
> 
> To fix this, I'm planning to add new functions:
> 
>    - fscrypt_prepare_new_inode() will set up the fscrypt_info for a new
>      'struct inode' which hasn't necessarily had an inode number assigned yet.
>      It won't set the encryption xattr yet.
> 

I more or less have that in 02/14, I think, but if you have something
else in mind, I'm happy to follow suit.

>    - fscrypt_set_context() will set the encryption xattr, using the fscrypt_info
>      that fscrypt_prepare_new_inode() created earlier.  It will replace
>      fscrypt_inherit_context().
> 

Ok. I don't think we'll need that for ceph, generally as we'll want to
send the context so it can be attached atomically during the create.

> I'm working on my patches at
> https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/log/?h=wip-fscrypt.
> They're not ready to send out yet, but I'll Cc you when I do.
> 

Thanks. I'll check them out.

> It seems there's still something a bit different that you need: you want
> fs/crypto/ to provide a buffer containing the encryption xattr (presumably
> because ceph needs to package it up into a single network request that creates
> the file?), instead of calling a function which then uses
> fscrypt_operations::set_context().  I could pretty easily handle that by adding
> a function that returns the xattr directly and would be an alternative to
> fscrypt_set_context().
> 

Exactly. We want to send the context with the create call.

> > Storing a binary crypttext filename on the MDS (or most network
> > fileservers) may be problematic. We'll probably end up having to base64
> > encode the names when storing them. I expect most network filesystems to
> > have similar issues. That may limit the effective NAME_MAX for some
> > filesystems [2].
> 
> I strongly recommend keeping support for the full NAME_MAX (255 bytes), if it's
> at all possible.  eCryptfs limited filenames to 143 bytes, which has
> historically caused lots of problems.  Try the following Google search to get a
> sense of the large number of users that have run into this limitation:
> https://www.google.com/search?q=ecryptfs+143+filename
> 

Absolutely. I don't think it'll be a problem with ceph as we have more
flexibility to change the protocol and underlying server
implementation. 

It would be really cool to weave this into NFS and CIFS/SMB eventually
too, but that might be more difficult as Linux has less control over
servers in the field, and most of them will cap out at 255 chars. Maybe
we could extend those protocols too though, if there were desire.

> > For content encryption, Ceph (and probably at least CIFS/SMB) will need
> > to deal with writes not aligned on crypto blocks. These filesystems
> > sometimes write synchronously to the server instead of going through the
> > pagecache [3].
> 
> I/O that isn't aligned to the "encryption data unit size" (which on the
> filesystems that currently support fscrypt, is the same thing as the
> "filesystem block size") isn't really possible unless it's buffered.  For
> AES-XTS specifically, *in principle* it's possible to encrypt/decrypt an
> individual 16-byte aligned region.  But Linux's crypto API doesn't currently
> support sub-message crypto, and also fscrypt supports the AES-CBC and Adiantum
> encryption modes which have stricter requirements.
> 
> So, I think that reads/writes to encrypted files will always need to go through
> the page cache.
> 

The ceph OSD protocol (which is the data path) can do cmpxchg-like
operations on an object. So, theoretically we can do a
read/modify/write, and the write would only work if the content hadn't
changed since the read. We'd then just redo the whole thing if there was
a conflicting change in the interim.

It'd be slow and would suck, but I think it would give us a correct
result. The good news is that usually the clients are granted the
capability to buffer up writes, at which point we can use the pagecache
like anything else. This would only be to ensure correctness in
contended cases.

> > Symlink handling in fscrypt will also need to be refactored a bit, as we
> > won't have an inode before we'll need to encrypt its contents.
> 
> Will there be an in-memory inode allocated yet (a 'struct inode'), just with no
> inode number assigned yet?  If so, my work-in-progress patchset I mentioned
> earlier should be sufficient to address this.  The order would be:
> 
> 	1. fscrypt_prepare_new_inode()
> 	2. fscrypt_encrypt_symlink()
> 	3. Assign inode number
> 
> 
> Or does ceph not have a 'struct inode' at all until step (3)?

No, generally ceph doesn't create an inode until the reply comes in. I
think we'll need to be able to create a context and encrypt the symlink
before we issue the call to the server. I started hacking at the fscrypt
code for this today, but I didn't get very far.

FWIW, ceph is a bit of an odd netfs protocol in that there is a standard
"trace" that holds info about dentries and inodes that are created or
modified as a result of an operation. Most of the dentry/inode cache
manipulation is done at that point, which is done as part of the reply
processing.

Thanks for the comments so far!
--
Jeff Layton <jlayton@kernel.org>
Eric Biggers Aug. 22, 2020, 2:34 a.m. UTC | #3
On Fri, Aug 21, 2020 at 08:58:35PM -0400, Jeff Layton wrote:
> > > Ceph (and most other netfs') will need to pre-create a crypto context
> > > when creating a new inode as we'll need to encrypt some things before we
> > > have an inode. This patchset stores contexts in an xattr, but that's
> > > probably not ideal for the final implementation [1].
> > 
> > Coincidentally, I've currently working on solving a similar problem.  On ext4,
> > the inode number can't be assigned, and the encryption xattr can't be set, until
> > the jbd2 transaction which creates the inode.  Also, if the new inode is a
> > symlink, then fscrypt_encrypt_symlink() has to be called during the transaction.
> > Together, these imply that fscrypt_get_encryption_info() has to be called during
> > the transaction.
> > 
> 
> Yes, similar problem. I started looking at symlinks today, and got a
> little ways into a patchset to refactor some fscrypt code to handle
> them, but I don't think it's quite right yet. A more general solution
> would be nice.
> 
> > That's what we do, currently.  However, it's technically wrong and can deadlock,
> > since fscrypt_get_encryption_info() isn't GFP_NOFS-safe (and it can't be).
> > 
> > f2fs appears to have a similar problem, though I'm still investigating.
> > 
> > To fix this, I'm planning to add new functions:
> > 
> >    - fscrypt_prepare_new_inode() will set up the fscrypt_info for a new
> >      'struct inode' which hasn't necessarily had an inode number assigned yet.
> >      It won't set the encryption xattr yet.
> > 
> 
> I more or less have that in 02/14, I think, but if you have something
> else in mind, I'm happy to follow suit.
[...]
> > > Symlink handling in fscrypt will also need to be refactored a bit, as we
> > > won't have an inode before we'll need to encrypt its contents.
> > 
> > Will there be an in-memory inode allocated yet (a 'struct inode'), just with no
> > inode number assigned yet?  If so, my work-in-progress patchset I mentioned
> > earlier should be sufficient to address this.  The order would be:
> > 
> > 	1. fscrypt_prepare_new_inode()
> > 	2. fscrypt_encrypt_symlink()
> > 	3. Assign inode number
> > 
> > 
> > Or does ceph not have a 'struct inode' at all until step (3)?
> 
> No, generally ceph doesn't create an inode until the reply comes in. I
> think we'll need to be able to create a context and encrypt the symlink
> before we issue the call to the server. I started hacking at the fscrypt
> code for this today, but I didn't get very far.
> 
> FWIW, ceph is a bit of an odd netfs protocol in that there is a standard
> "trace" that holds info about dentries and inodes that are created or
> modified as a result of an operation. Most of the dentry/inode cache
> manipulation is done at that point, which is done as part of the reply
> processing.

Your patch "fscrypt: add fscrypt_new_context_from_parent" takes in a directory
and generates an fscrypt_context (a.k.a. an encryption xattr) for a new file
that will be created in that directory.

fscrypt_prepare_new_inode() from my work-in-progress patches would do a bit more
than that.  It would actually set up a "struct fscrypt_info" for a new inode.
That includes the encryption key and all information needed to build the
fscrypt_context.  So, afterwards it will be possible to call
fscrypt_encrypt_symlink() before the fscrypt_context is "saved to disk".
IIUC, that's part of what ceph will need.

The catch is that there will still have to be a 'struct inode' to associate the
'struct fscrypt_info' with.  It won't have to have ->i_ino set yet, but some
other fields (at least ->i_mode and ->i_sb) will have to be set, since lots of
code in fs/crypto/ uses those fields.

I think it would be possible to refactor things to make 'struct fscrypt_info'
more separate from 'struct inode', so that filesystems could create a
'struct fscrypt_info' that isn't associated with an inode yet, then encrypt a
symlink target using it (not caching it in ->i_link as we currently do).

However, it would require a lot of changes.

So I'm wondering if it would be easier to instead change ceph to create and
start initializing the 'struct inode' earlier.  It doesn't have to have an inode
number assigned or be added to the inode cache yet; it just needs to be
allocated in memory and some basic fields need to be initialized.  In theory
it's possible, right?  I'd expect that local filesystems aren't even that much
different, in principle; they start initializing a new 'struct inode' in memory
first, and only later do they *really* create the inode by allocating an inode
number and saving the changes to disk.

- Eric
Jeff Layton Aug. 24, 2020, 12:03 p.m. UTC | #4
On Fri, 2020-08-21 at 19:34 -0700, Eric Biggers wrote:
> On Fri, Aug 21, 2020 at 08:58:35PM -0400, Jeff Layton wrote:
> > > > Ceph (and most other netfs') will need to pre-create a crypto context
> > > > when creating a new inode as we'll need to encrypt some things before we
> > > > have an inode. This patchset stores contexts in an xattr, but that's
> > > > probably not ideal for the final implementation [1].
> > > 
> > > Coincidentally, I've currently working on solving a similar problem.  On ext4,
> > > the inode number can't be assigned, and the encryption xattr can't be set, until
> > > the jbd2 transaction which creates the inode.  Also, if the new inode is a
> > > symlink, then fscrypt_encrypt_symlink() has to be called during the transaction.
> > > Together, these imply that fscrypt_get_encryption_info() has to be called during
> > > the transaction.
> > > 
> > 
> > Yes, similar problem. I started looking at symlinks today, and got a
> > little ways into a patchset to refactor some fscrypt code to handle
> > them, but I don't think it's quite right yet. A more general solution
> > would be nice.
> > 
> > > That's what we do, currently.  However, it's technically wrong and can deadlock,
> > > since fscrypt_get_encryption_info() isn't GFP_NOFS-safe (and it can't be).
> > > 
> > > f2fs appears to have a similar problem, though I'm still investigating.
> > > 
> > > To fix this, I'm planning to add new functions:
> > > 
> > >    - fscrypt_prepare_new_inode() will set up the fscrypt_info for a new
> > >      'struct inode' which hasn't necessarily had an inode number assigned yet.
> > >      It won't set the encryption xattr yet.
> > > 
> > 
> > I more or less have that in 02/14, I think, but if you have something
> > else in mind, I'm happy to follow suit.
> [...]
> > > > Symlink handling in fscrypt will also need to be refactored a bit, as we
> > > > won't have an inode before we'll need to encrypt its contents.
> > > 
> > > Will there be an in-memory inode allocated yet (a 'struct inode'), just with no
> > > inode number assigned yet?  If so, my work-in-progress patchset I mentioned
> > > earlier should be sufficient to address this.  The order would be:
> > > 
> > > 	1. fscrypt_prepare_new_inode()
> > > 	2. fscrypt_encrypt_symlink()
> > > 	3. Assign inode number
> > > 
> > > 
> > > Or does ceph not have a 'struct inode' at all until step (3)?
> > 
> > No, generally ceph doesn't create an inode until the reply comes in. I
> > think we'll need to be able to create a context and encrypt the symlink
> > before we issue the call to the server. I started hacking at the fscrypt
> > code for this today, but I didn't get very far.
> > 
> > FWIW, ceph is a bit of an odd netfs protocol in that there is a standard
> > "trace" that holds info about dentries and inodes that are created or
> > modified as a result of an operation. Most of the dentry/inode cache
> > manipulation is done at that point, which is done as part of the reply
> > processing.
> 
> Your patch "fscrypt: add fscrypt_new_context_from_parent" takes in a directory
> and generates an fscrypt_context (a.k.a. an encryption xattr) for a new file
> that will be created in that directory.
> 
> fscrypt_prepare_new_inode() from my work-in-progress patches would do a bit more
> than that.  It would actually set up a "struct fscrypt_info" for a new inode.
> That includes the encryption key and all information needed to build the
> fscrypt_context.  So, afterwards it will be possible to call
> fscrypt_encrypt_symlink() before the fscrypt_context is "saved to disk".
> IIUC, that's part of what ceph will need.
> 
> The catch is that there will still have to be a 'struct inode' to associate the
> 'struct fscrypt_info' with.  It won't have to have ->i_ino set yet, but some
> other fields (at least ->i_mode and ->i_sb) will have to be set, since lots of
> code in fs/crypto/ uses those fields.
> 
> I think it would be possible to refactor things to make 'struct fscrypt_info'
> more separate from 'struct inode', so that filesystems could create a
> 'struct fscrypt_info' that isn't associated with an inode yet, then encrypt a
> symlink target using it (not caching it in ->i_link as we currently do).
> 
> However, it would require a lot of changes.
> 
> So I'm wondering if it would be easier to instead change ceph to create and
> start initializing the 'struct inode' earlier.  It doesn't have to have an inode
> number assigned or be added to the inode cache yet; it just needs to be
> allocated in memory and some basic fields need to be initialized.  In theory
> it's possible, right?  I'd expect that local filesystems aren't even that much
> different, in principle; they start initializing a new 'struct inode' in memory
> first, and only later do they *really* create the inode by allocating an inode
> number and saving the changes to disk.
> 

It's probably possible. I think we'd just need to attach the nascent
inode to the MDS request tracking structure, and convert that from using
iget5_locked to inode_insert5.

Would we need to do this for all inode types or just symlinks?
Eric Biggers Aug. 24, 2020, 4:55 p.m. UTC | #5
On Mon, Aug 24, 2020 at 08:03:35AM -0400, Jeff Layton wrote:
> On Fri, 2020-08-21 at 19:34 -0700, Eric Biggers wrote:
> > On Fri, Aug 21, 2020 at 08:58:35PM -0400, Jeff Layton wrote:
> > > > > Ceph (and most other netfs') will need to pre-create a crypto context
> > > > > when creating a new inode as we'll need to encrypt some things before we
> > > > > have an inode. This patchset stores contexts in an xattr, but that's
> > > > > probably not ideal for the final implementation [1].
> > > > 
> > > > Coincidentally, I've currently working on solving a similar problem.  On ext4,
> > > > the inode number can't be assigned, and the encryption xattr can't be set, until
> > > > the jbd2 transaction which creates the inode.  Also, if the new inode is a
> > > > symlink, then fscrypt_encrypt_symlink() has to be called during the transaction.
> > > > Together, these imply that fscrypt_get_encryption_info() has to be called during
> > > > the transaction.
> > > > 
> > > 
> > > Yes, similar problem. I started looking at symlinks today, and got a
> > > little ways into a patchset to refactor some fscrypt code to handle
> > > them, but I don't think it's quite right yet. A more general solution
> > > would be nice.
> > > 
> > > > That's what we do, currently.  However, it's technically wrong and can deadlock,
> > > > since fscrypt_get_encryption_info() isn't GFP_NOFS-safe (and it can't be).
> > > > 
> > > > f2fs appears to have a similar problem, though I'm still investigating.
> > > > 
> > > > To fix this, I'm planning to add new functions:
> > > > 
> > > >    - fscrypt_prepare_new_inode() will set up the fscrypt_info for a new
> > > >      'struct inode' which hasn't necessarily had an inode number assigned yet.
> > > >      It won't set the encryption xattr yet.
> > > > 
> > > 
> > > I more or less have that in 02/14, I think, but if you have something
> > > else in mind, I'm happy to follow suit.
> > [...]
> > > > > Symlink handling in fscrypt will also need to be refactored a bit, as we
> > > > > won't have an inode before we'll need to encrypt its contents.
> > > > 
> > > > Will there be an in-memory inode allocated yet (a 'struct inode'), just with no
> > > > inode number assigned yet?  If so, my work-in-progress patchset I mentioned
> > > > earlier should be sufficient to address this.  The order would be:
> > > > 
> > > > 	1. fscrypt_prepare_new_inode()
> > > > 	2. fscrypt_encrypt_symlink()
> > > > 	3. Assign inode number
> > > > 
> > > > 
> > > > Or does ceph not have a 'struct inode' at all until step (3)?
> > > 
> > > No, generally ceph doesn't create an inode until the reply comes in. I
> > > think we'll need to be able to create a context and encrypt the symlink
> > > before we issue the call to the server. I started hacking at the fscrypt
> > > code for this today, but I didn't get very far.
> > > 
> > > FWIW, ceph is a bit of an odd netfs protocol in that there is a standard
> > > "trace" that holds info about dentries and inodes that are created or
> > > modified as a result of an operation. Most of the dentry/inode cache
> > > manipulation is done at that point, which is done as part of the reply
> > > processing.
> > 
> > Your patch "fscrypt: add fscrypt_new_context_from_parent" takes in a directory
> > and generates an fscrypt_context (a.k.a. an encryption xattr) for a new file
> > that will be created in that directory.
> > 
> > fscrypt_prepare_new_inode() from my work-in-progress patches would do a bit more
> > than that.  It would actually set up a "struct fscrypt_info" for a new inode.
> > That includes the encryption key and all information needed to build the
> > fscrypt_context.  So, afterwards it will be possible to call
> > fscrypt_encrypt_symlink() before the fscrypt_context is "saved to disk".
> > IIUC, that's part of what ceph will need.
> > 
> > The catch is that there will still have to be a 'struct inode' to associate the
> > 'struct fscrypt_info' with.  It won't have to have ->i_ino set yet, but some
> > other fields (at least ->i_mode and ->i_sb) will have to be set, since lots of
> > code in fs/crypto/ uses those fields.
> > 
> > I think it would be possible to refactor things to make 'struct fscrypt_info'
> > more separate from 'struct inode', so that filesystems could create a
> > 'struct fscrypt_info' that isn't associated with an inode yet, then encrypt a
> > symlink target using it (not caching it in ->i_link as we currently do).
> > 
> > However, it would require a lot of changes.
> > 
> > So I'm wondering if it would be easier to instead change ceph to create and
> > start initializing the 'struct inode' earlier.  It doesn't have to have an inode
> > number assigned or be added to the inode cache yet; it just needs to be
> > allocated in memory and some basic fields need to be initialized.  In theory
> > it's possible, right?  I'd expect that local filesystems aren't even that much
> > different, in principle; they start initializing a new 'struct inode' in memory
> > first, and only later do they *really* create the inode by allocating an inode
> > number and saving the changes to disk.
> > 
> 
> It's probably possible. I think we'd just need to attach the nascent
> inode to the MDS request tracking structure, and convert that from using
> iget5_locked to inode_insert5.
> 
> Would we need to do this for all inode types or just symlinks?

It would be all inodes, since fscrypt_prepare_new_inode() will handle all types
of inodes.

- Eric