mbox series

[v1,00/12] fscrypt: add extent encryption

Message ID cover.1687988246.git.sweettea-kernel@dorminy.me (mailing list archive)
Headers show
Series fscrypt: add extent encryption | expand

Message

Sweet Tea Dorminy June 29, 2023, 12:29 a.m. UTC
This changeset adds extent-based data encryption to fscrypt.
Some filesystems need to encrypt data based on extents, rather than on
inodes, due to features incompatible with inode-based encryption. For
instance, btrfs can have multiple inodes referencing a single block of
data, and moves logical data blocks to different physical locations on
disk in the background. 

As per discussion last year in [1] and later in [2], we would like to
allow the use of fscrypt with btrfs, with authenticated encryption. This
is the first step of that work, adding extent-based encryption to
fscrypt; authenticated encryption is the next step. Extent-based
encryption should be usable by other filesystems which wish to support
snapshotting or background data rearrangement also, but btrfs is the
first user. 

This changeset requires extent encryption to use inlinecrypt, as
discussed previously. There are two questionable parts: the
forget_extent_info hook is not yet in use by btrfs, as I haven't yet
written a test exercising a race where it would be relevant; and saving
the session key credentials just to enable v1 session-based policies is
perhaps less good than 

This applies atop [3], which itself is based on kdave/misc-next. It
passes most encryption fstests with suitable changes to btrfs-progs, but
not generic/580 or generic/595 due to different timing involved in
extent encryption. Tests and btrfs progs updates to follow.


[1] https://docs.google.com/document/d/1janjxewlewtVPqctkWOjSa7OhCgB8Gdx7iDaCDQQNZA/edit?usp=sharing
[2] https://lore.kernel.org/linux-fscrypt/80496cfe-161d-fb0d-8230-93818b966b1b@dorminy.me/T/#t
[3]
https://lore.kernel.org/linux-fscrypt/cover.1687988119.git.sweettea-kernel@dorminy.me/

Sweet Tea Dorminy (12):
  fscrypt: factor helper for locking master key
  fscrypt: factor getting info for a specific block
  fscrypt: adjust effective lblks based on extents
  fscrypt: add a super_block pointer to fscrypt_info
  fscrypt: setup leaf inodes for extent encryption
  fscrypt: allow infos to be owned by extents
  fscrypt: notify per-extent infos if master key vanishes
  fscrypt: use an optional ino equivalent for per-extent infos
  fscrypt: add creation/usage/freeing of per-extent infos
  fscrypt: allow load/save of extent contexts
  fscrypt: save session key credentials for extent infos
  fscrypt: update documentation for per-extent keys

 Documentation/filesystems/fscrypt.rst |  38 +++-
 fs/crypto/crypto.c                    |   6 +-
 fs/crypto/fscrypt_private.h           |  91 ++++++++++
 fs/crypto/inline_crypt.c              |  28 ++-
 fs/crypto/keyring.c                   |  32 +++-
 fs/crypto/keysetup.c                  | 244 ++++++++++++++++++++++----
 fs/crypto/keysetup_v1.c               |   7 +-
 fs/crypto/policy.c                    |  20 +++
 include/linux/fscrypt.h               |  74 ++++++++
 9 files changed, 480 insertions(+), 60 deletions(-)


base-commit: accadeb67609a5a5d088ebde8409c3f6db0b84b4

Comments

Eric Biggers July 3, 2023, 4:54 a.m. UTC | #1
Hi Sweet Tea,

On Wed, Jun 28, 2023 at 08:29:30PM -0400, Sweet Tea Dorminy wrote:
> This changeset adds extent-based data encryption to fscrypt.
> Some filesystems need to encrypt data based on extents, rather than on
> inodes, due to features incompatible with inode-based encryption. For
> instance, btrfs can have multiple inodes referencing a single block of
> data, and moves logical data blocks to different physical locations on
> disk in the background. 
> 
> As per discussion last year in [1] and later in [2], we would like to
> allow the use of fscrypt with btrfs, with authenticated encryption. This
> is the first step of that work, adding extent-based encryption to
> fscrypt; authenticated encryption is the next step. Extent-based
> encryption should be usable by other filesystems which wish to support
> snapshotting or background data rearrangement also, but btrfs is the
> first user. 
> 
> This changeset requires extent encryption to use inlinecrypt, as
> discussed previously. There are two questionable parts: the
> forget_extent_info hook is not yet in use by btrfs, as I haven't yet
> written a test exercising a race where it would be relevant; and saving
> the session key credentials just to enable v1 session-based policies is
> perhaps less good than 
> 
> This applies atop [3], which itself is based on kdave/misc-next. It
> passes most encryption fstests with suitable changes to btrfs-progs, but
> not generic/580 or generic/595 due to different timing involved in
> extent encryption. Tests and btrfs progs updates to follow.
> 
> 
> [1] https://docs.google.com/document/d/1janjxewlewtVPqctkWOjSa7OhCgB8Gdx7iDaCDQQNZA/edit?usp=sharing
> [2] https://lore.kernel.org/linux-fscrypt/80496cfe-161d-fb0d-8230-93818b966b1b@dorminy.me/T/#t
> [3]
> https://lore.kernel.org/linux-fscrypt/cover.1687988119.git.sweettea-kernel@dorminy.me/
> 
> Sweet Tea Dorminy (12):
>   fscrypt: factor helper for locking master key
>   fscrypt: factor getting info for a specific block
>   fscrypt: adjust effective lblks based on extents
>   fscrypt: add a super_block pointer to fscrypt_info
>   fscrypt: setup leaf inodes for extent encryption
>   fscrypt: allow infos to be owned by extents
>   fscrypt: notify per-extent infos if master key vanishes
>   fscrypt: use an optional ino equivalent for per-extent infos
>   fscrypt: add creation/usage/freeing of per-extent infos
>   fscrypt: allow load/save of extent contexts
>   fscrypt: save session key credentials for extent infos
>   fscrypt: update documentation for per-extent keys
> 
>  Documentation/filesystems/fscrypt.rst |  38 +++-
>  fs/crypto/crypto.c                    |   6 +-
>  fs/crypto/fscrypt_private.h           |  91 ++++++++++
>  fs/crypto/inline_crypt.c              |  28 ++-
>  fs/crypto/keyring.c                   |  32 +++-
>  fs/crypto/keysetup.c                  | 244 ++++++++++++++++++++++----
>  fs/crypto/keysetup_v1.c               |   7 +-
>  fs/crypto/policy.c                    |  20 +++
>  include/linux/fscrypt.h               |  74 ++++++++
>  9 files changed, 480 insertions(+), 60 deletions(-)
> 
> 
> base-commit: accadeb67609a5a5d088ebde8409c3f6db0b84b4

Thanks for sending this out!

It's going to take me a while to go through everything, so please bear with me.
In general I'd also really like to be seeing more feedback from the other btrfs
developers.  This is a hard project that really needs more eyes on it.

From a brief look through your patchsets, there's one thing I want to bring up
right away.  It seems that one important design choice that you've made that has
impacted much of your patchsets is that you've made each extent a fully
standalone thing, similar to inodes currently.  I.e.,

    (a) Each extent gets a full 'fscrypt_context' stored along with it.  That
        includes not just the nonce, but also the encryption modes and master
        key identifier.

    (b) For runtime caching, each extent gets a full 'struct fscrypt_info'
        object.  It doesn't "belong" to any inode; it's set up in a fully
        standalone way, and the master key lookup and removal logic operates
        directly on the extent's 'struct fscrypt_info'.

I'm not sure this is a good idea.  What I had thought it was going to look like
is that the encryption context/policy and 'struct fscrypt_info' would stay a
property of the inode, and the extents themselves would be much more lightweight
-- both on disk and in the cache.  On-disk, all that should really be needed for
an extent is the nonce for deriving the per-extent key.  And in-memory, all that
should really be needed is a "fscrypt_prepared_key" for the per-extent key, and
a reference to the owning inode.

I think that would avoid many of the problems that it seems you've had to work
around or have had to change user-visible semantics for.  For example the
problems involving master keys being added and removed.  It would also avoid
having to overload 'fscrypt_info' to be either a per-inode or a per-extent key.
And it would save space on disk and in memory.

Can you elaborate on why you went with a more "heavyweight" extents design?

Maybe your motivation is that extents can be referenced by more than one inode
and thus do not have a unique owning inode?  That's true, but I don't think that
really matters.  All the inodes that reference an extent will have the same
encryption policy, right?  Also, it looks like the "struct extent_map" that
you're caching the per-extent key in is already cached on a per-inode basis, in
btrfs_inode::extent_tree, similar to the pagecache which is also per-inode.  So
if the same extent happens to be accessed via multiple inodes, that's still
going to cause the fscrypt key to be set up twice anyway.

- Eric
Sweet Tea Dorminy July 3, 2023, 5:06 p.m. UTC | #2
>> base-commit: accadeb67609a5a5d088ebde8409c3f6db0b84b4
> 
> Thanks for sending this out!
> 
> It's going to take me a while to go through everything, so please bear with me.
> In general I'd also really like to be seeing more feedback from the other btrfs
> developers.  This is a hard project that really needs more eyes on it.

I appreciate your time on it!
> 
>  From a brief look through your patchsets, there's one thing I want to bring up
> right away.  It seems that one important design choice that you've made that has
> impacted much of your patchsets is that you've made each extent a fully
> standalone thing, similar to inodes currently.  I.e.,
> 
>      (a) Each extent gets a full 'fscrypt_context' stored along with it.  That
>          includes not just the nonce, but also the encryption modes and master
>          key identifier.
> 
>      (b) For runtime caching, each extent gets a full 'struct fscrypt_info'
>          object.  It doesn't "belong" to any inode; it's set up in a fully
>          standalone way, and the master key lookup and removal logic operates
>          directly on the extent's 'struct fscrypt_info'.
> 
> I'm not sure this is a good idea.  What I had thought it was going to look like
> is that the encryption context/policy and 'struct fscrypt_info' would stay a
> property of the inode, and the extents themselves would be much more lightweight
> -- both on disk and in the cache.  On-disk, all that should really be needed for
> an extent is the nonce for deriving the per-extent key.  And in-memory, all that
> should really be needed is a "fscrypt_prepared_key" for the per-extent key, and
> a reference to the owning inode.
 >

The in memory reduction is plausible. For extents that are in memory but 
not yet written to disk, we need some way to keep track of the context, 
but we could drop the nonce/policy after that. I was aiming to have the 
same structure so that there's maximal similarity in info creation and 
things like fscrypt_generate_iv would always be getting an info, 
regardless of inode vs extent, but we could throw a conditional in there 
and create a different structure for in-memory extent infos.

However, it seems like an extent and a leaf inode in inode fscrypt need 
the same information, so if splitting the fscrypt_info structure makes 
sense, maybe it should be on that boundary?

> 
> I think that would avoid many of the problems that it seems you've had to work
> around or have had to change user-visible semantics for.  For example the
> problems involving master keys being added and removed.  It would also avoid
> having to overload 'fscrypt_info' to be either a per-inode or a per-extent key.
> And it would save space on disk and in memory.

I might be misunderstanding what you're referencing, but I think you're 
talking about the change where with extent fscrypt, IO has to be forced 
down before removing a key, otherwise it is lost. I think that's a 
fundamental problem given the filesystem has no way to know that there 
are new, dirty pages in the pagecache until those pages are issued for 
write, so it can't create a new extent or few until that point, 
potentially after the relevant key has been evicted. But maybe I'm 
missing a hook that would let us make extents earlier.

I suppose we could give each leaf inode a proper nonce/prepared key 
instead of borrowing its parent dir's: if a write came in after the key 
was removed but the inode is still open, the new extent(s) could grab 
the key material out of the inode's info. I don't like this very much 
since it could result in multiple extents grabbing the same key 
material, but I suppose it could work if it's important to maintain that 
behavior.
> 
> Can you elaborate on why you went with a more "heavyweight" extents design?
Being able to rekey a directory is the reason for having full contexts: 
suppose I take a snapshot of an encrypted dir and want to change the key 
for new data going forward, to avoid using a single key on too much 
data. It's too expensive to reencrypt every extent with the new key, 
since the whole point of a snapshot is to make a lightweight copy that 
gets COWed on write. Then each extent needs to know what its own master 
key identifier/policy/flags are, since different extents in the same 
file could have different master keys. We could say the mode and flags 
have to match, but it doesn't seem to me worth saving seven bytes to add 
a new structure to just store the master key identifier and nonce.

For a non-Meta usecase, from what I've heard from Fedora-land, it's 
possibly interesting to them to be able to ship an encrypted image, and 
then be able to change the key after encrypted install to something 
user-controlled.

Without rekeying, my understanding is that we may write too much data 
with one key for safety; notes in the updated design doc 
https://docs.google.com/document/d/1janjxewlewtVPqctkWOjSa7OhCgB8Gdx7iDaCDQQNZA/edit?usp=sharing 
are that writing more than 1P per key raises cryptographic concerns, and 
since btrfs is COW and could have volumes up to the full 16E size that 
btrfs supports, we don't want to have just one immutable key per subvol.

To me the lightweight-on-disk vision sounds a lot like the original 
design: 
https://lore.kernel.org/linux-btrfs/YXGyq+buM79A1S0L@relinquished.localdomain 
and the Nov '22 version of the patchset: 
https://lore.kernel.org/linux-btrfs/cover.1667389115.git.sweettea-kernel@dorminy.me/ 
(which didn't have rekeying). I think rekeying is worth the higher disk 
usage; but I'm probably missing something about how your vision differs 
from the original. Could you please take a look at it again?

> 
> Maybe your motivation is that extents can be referenced by more than one inode
> and thus do not have a unique owning inode?  That's true, but I don't think that
> really matters.  All the inodes that reference an extent will have the same
> encryption policy, right? 
As above, not necessarily

> Also, it looks like the "struct extent_map" that
> you're caching the per-extent key in is already cached on a per-inode basis, in
> btrfs_inode::extent_tree, similar to the pagecache which is also per-inode.  So
> if the same extent happens to be accessed via multiple inodes, that's still
> going to cause the fscrypt key to be set up twice anyway.

A good point, and if you want me to take advantage of the 
one-copy-per-inode fact for general extent-based fscrypt I can do so.

Many thanks!

Sweet Tea
Eric Biggers July 3, 2023, 6:17 p.m. UTC | #3
On Mon, Jul 03, 2023 at 01:06:17PM -0400, Sweet Tea Dorminy wrote:
> > 
> > I think that would avoid many of the problems that it seems you've had to work
> > around or have had to change user-visible semantics for.  For example the
> > problems involving master keys being added and removed.  It would also avoid
> > having to overload 'fscrypt_info' to be either a per-inode or a per-extent key.
> > And it would save space on disk and in memory.
> 
> I might be misunderstanding what you're referencing, but I think you're
> talking about the change where with extent fscrypt, IO has to be forced down
> before removing a key, otherwise it is lost. I think that's a fundamental
> problem given the filesystem has no way to know that there are new, dirty
> pages in the pagecache until those pages are issued for write, so it can't
> create a new extent or few until that point, potentially after the relevant
> key has been evicted. But maybe I'm missing a hook that would let us make
> extents earlier.
> 
> I suppose we could give each leaf inode a proper nonce/prepared key instead
> of borrowing its parent dir's: if a write came in after the key was removed
> but the inode is still open, the new extent(s) could grab the key material
> out of the inode's info. I don't like this very much since it could result
> in multiple extents grabbing the same key material, but I suppose it could
> work if it's important to maintain that behavior.

Right, if extent keys are derived directly from the master key, and
FS_IOC_REMOVE_ENCRYPTION_KEY is executed which causes the master key secret to
be wiped, then no more extent keys can be derived.

So I see why you implemented the behavior you did.  It does seem a bit
dangerous, though.  If I understand correctly, under your proposal, if
FS_IOC_REMOVE_ENCRYPTION_KEY is executed while a process is doing buffered
writes to an encrypted file protected by that key, then past writes will get
synced out just before the key removal, but future writes will just be silently
thrown away.  (Remember, writes in Linux are asynchronous; processes don't get
informed of write errors unless they call fsync() or are doing direct I/O.)

I wonder if we should just keep the master key around, for per-extent key
derivation only, until all files using that master key have been closed.  That
would minimize the changes from the current fscrypt semantics.

> > Can you elaborate on why you went with a more "heavyweight" extents design?
> Being able to rekey a directory is the reason for having full contexts:
> suppose I take a snapshot of an encrypted dir and want to change the key for
> new data going forward, to avoid using a single key on too much data. It's
> too expensive to reencrypt every extent with the new key, since the whole
> point of a snapshot is to make a lightweight copy that gets COWed on write.
> Then each extent needs to know what its own master key
> identifier/policy/flags are, since different extents in the same file could
> have different master keys. We could say the mode and flags have to match,
> but it doesn't seem to me worth saving seven bytes to add a new structure to
> just store the master key identifier and nonce.
> 
> For a non-Meta usecase, from what I've heard from Fedora-land, it's possibly
> interesting to them to be able to ship an encrypted image, and then be able
> to change the key after encrypted install to something user-controlled.
> 
> Without rekeying, my understanding is that we may write too much data with
> one key for safety; notes in the updated design doc https://docs.google.com/document/d/1janjxewlewtVPqctkWOjSa7OhCgB8Gdx7iDaCDQQNZA/edit?usp=sharing
> are that writing more than 1P per key raises cryptographic concerns, and
> since btrfs is COW and could have volumes up to the full 16E size that btrfs
> supports, we don't want to have just one immutable key per subvol.
> 
> To me the lightweight-on-disk vision sounds a lot like the original design:
> https://lore.kernel.org/linux-btrfs/YXGyq+buM79A1S0L@relinquished.localdomain
> and the Nov '22 version of the patchset: https://lore.kernel.org/linux-btrfs/cover.1667389115.git.sweettea-kernel@dorminy.me/
> (which didn't have rekeying). I think rekeying is worth the higher disk
> usage; but I'm probably missing something about how your vision differs from
> the original. Could you please take a look at it again?

Your original design didn't use per-extent keys, but rather had a single
contents encryption key per master key.  We had discussed that that approach had
multiple disadvantages, one of which is that on btrfs it can run uncomfortably
close to the cryptographic limits for the contents encryption modes such as
AES-XTS.  So we decided to go with per-extent keys instead.

I don't think we discussed cryptographic limits on the master key itself.  That
is actually much less of a concern, as the master key is just used for
HKDF-SHA512.  I don't think there is any real need to ever change the master
key.  (Well, if it is compromised, it could be needed, but that's not really
relevant here.  If that happens you'd need to re-encrypt everything anyway.)

I do recall some discussion of making it possible to set an encryption policy on
an *unencrypted* directory, causing new files in that directory to be encrypted.
However, I don't recall any discussion of making it possible to add another
encryption policy to an *already-encrypted* directory.  I think this is the
first time this has been brought up.

I think that allowing directories to have multiple encryption policies would
bring in a lot of complexity.  How would it be configured, and what would the
semantics for accessing it be?  Where would the encryption policies be stored?
What if you have added some of the keys but not all of them?  What if some of
the keys get removed but not all of them?

Can you elaborate more on why you want this?  I was already a bit concerned
about the plan for making it possible to set an encryption policy on an
unencrypted directory, as that already diverges from the existing fscrypt
semantics.  But now it sounds like the scope has grown even more.

Keep in mind that in general, the closer we are able to stick to the existing
fscrypt semantics and design, the easier it is going to be to get the initial
btrfs fscrypt support merged.  Planning ahead for new features is good, but we
should also be careful not to overdesign.

- Eric
Sweet Tea Dorminy July 3, 2023, 8:37 p.m. UTC | #4
On 7/3/23 14:17, Eric Riggers wrote:
> On Mon, Jul 03, 2023 at 01:06:17PM -0400, Sweet Tea Dorminy wrote:
>>>
>>> I think that would avoid many of the problems that it seems you've had to work
>>> around or have had to change user-visible semantics for.  For example the
>>> problems involving master keys being added and removed.  It would also avoid
>>> having to overload 'fscrypt_info' to be either a per-inode or a per-extent key.
>>> And it would save space on disk and in memory.
>>
>> I might be misunderstanding what you're referencing, but I think you're
>> talking about the change where with extent fscrypt, IO has to be forced down
>> before removing a key, otherwise it is lost. I think that's a fundamental
>> problem given the filesystem has no way to know that there are new, dirty
>> pages in the pagecache until those pages are issued for write, so it can't
>> create a new extent or few until that point, potentially after the relevant
>> key has been evicted. But maybe I'm missing a hook that would let us make
>> extents earlier.
>>
>> I suppose we could give each leaf inode a proper nonce/prepared key instead
>> of borrowing its parent dir's: if a write came in after the key was removed
>> but the inode is still open, the new extent(s) could grab the key material
>> out of the inode's info. I don't like this very much since it could result
>> in multiple extents grabbing the same key material, but I suppose it could
>> work if it's important to maintain that behavior.
> 
> Right, if extent keys are derived directly from the master key, and
> FS_IOC_REMOVE_ENCRYPTION_KEY is executed which causes the master key secret to
> be wiped, then no more extent keys can be derived.
> 
> So I see why you implemented the behavior you did.  It does seem a bit
> dangerous, though.  If I understand correctly, under your proposal, if
> FS_IOC_REMOVE_ENCRYPTION_KEY is executed while a process is doing buffered
> writes to an encrypted file protected by that key, then past writes will get
> synced out just before the key removal, but future writes will just be silently
> thrown away.  (Remember, writes in Linux are asynchronous; processes don't get
> informed of write errors unless they call fsync() or are doing direct I/O.)
> I wonder if we should just keep the master key around, for per-extent key
> derivation only, until all files using that master key have been closed.  That
> would minimize the changes from the current fscrypt semantics.

That would work, if you're good with that. It felt like a larger 
deviation, with the possibility of unbounded continued use of a 
soft-removed master key; should fscrypt restrict use of the master key 
only to extents that are part of inodes that are already open?

> 
>>> Can you elaborate on why you went with a more "heavyweight" extents design?
>> Being able to rekey a directory is the reason for having full contexts:
>> suppose I take a snapshot of an encrypted dir and want to change the key for
>> new data going forward, to avoid using a single key on too much data. It's
>> too expensive to reencrypt every extent with the new key, since the whole
>> point of a snapshot is to make a lightweight copy that gets COWed on write.
>> Then each extent needs to know what its own master key
>> identifier/policy/flags are, since different extents in the same file could
>> have different master keys. We could say the mode and flags have to match,
>> but it doesn't seem to me worth saving seven bytes to add a new structure to
>> just store the master key identifier and nonce.
>>
>> For a non-Meta usecase, from what I've heard from Fedora-land, it's possibly
>> interesting to them to be able to ship an encrypted image, and then be able
>> to change the key after encrypted install to something user-controlled.
>>
>> Without rekeying, my understanding is that we may write too much data with
>> one key for safety; notes in the updated design doc https://docs.google.com/document/d/1janjxewlewtVPqctkWOjSa7OhCgB8Gdx7iDaCDQQNZA/edit?usp=sharing
>> are that writing more than 1P per key raises cryptographic concerns, and
>> since btrfs is COW and could have volumes up to the full 16E size that btrfs
>> supports, we don't want to have just one immutable key per subvol.
>>
>> To me the lightweight-on-disk vision sounds a lot like the original design:
>> https://lore.kernel.org/linux-btrfs/YXGyq+buM79A1S0L@relinquished.localdomain
>> and the Nov '22 version of the patchset: https://lore.kernel.org/linux-btrfs/cover.1667389115.git.sweettea-kernel@dorminy.me/
>> (which didn't have rekeying). I think rekeying is worth the higher disk
>> usage; but I'm probably missing something about how your vision differs from
>> the original. Could you please take a look at it again?
> 
> Your original design didn't use per-extent keys, but rather had a single
> contents encryption key per master key.  We had discussed that that approach had
> multiple disadvantages, one of which is that on btrfs it can run uncomfortably
> close to the cryptographic limits for the contents encryption modes such as
> AES-XTS.  So we decided to go with per-extent keys instead.

> I don't think we discussed cryptographic limits on the master key itself.  That
> is actually much less of a concern, as the master key is just used for
> HKDF-SHA512.  I don't think there is any real need to ever change the master
> key.  (Well, if it is compromised, it could be needed, but that's not really
> relevant here.  If that happens you'd need to re-encrypt everything anyway.)
So you're proposing just storing a nonce per extent, then setting up a 
prepared key only, taking the inode's master key and doing the hkdf 
expand with the nonce into the prepared key? A terse approach, a lot 
closer to the original design than I thought.

> I do recall some discussion of making it possible to set an encryption policy on
> an *unencrypted* directory, causing new files in that directory to be encrypted.
> However, I don't recall any discussion of making it possible to add another
> encryption policy to an *already-encrypted* directory.  I think this is the
> first time this has been brought up.

I think I referenced it in the updated design (fifth paragraph of 
"Extent encryption" 
https://docs.google.com/document/d/1janjxewlewtVPqctkWOjSa7OhCgB8Gdx7iDaCDQQNZA/edit?usp=sharing) 
but I didn't talk about it enough -- 'rekeying' is a substitute for 
adding a policy to a directory full of unencrypted data. Ya'll's points 
about the badness of having mixed unencrypted and encrypted data in a 
single dir were compelling. (As I recall it, the issue with having mixed 
enc/unenc data is that a bug or attacker could point an encrypted file 
autostarted in a container, say /container/my-service, at a unencrypted 
extent under their control, say /bin/bash, and thereby acquire a backdoor.)
> 
> I think that allowing directories to have multiple encryption policies would
> bring in a lot of complexity.  How would it be configured, and what would the
> semantics for accessing it be?  Where would the encryption policies be stored?
> What if you have added some of the keys but not all of them?  What if some of
> the keys get removed but not all of them?
I'd planned to use add_enckey to add all the necessary keys, 
set_encpolicy on an encrypted directory under the proper conditions 
(flags interpreted by ioctl? check if filesystem has hook?) recursively 
calls a filesystem-provided hook on each inode within to change the 
fscrypt_context. Either dir items could be reencrypted with the new key 
or dirs could keep around the oldest key for encryption and the newest 
key for new leaf inodes to inherit.

For leaf inodes, 2 options for the remaining questions:

1) leaf inodes just stores the policy to be used for writing new data or 
inherits it from a parent directory. If you're reading and you hit an 
extent whose key isn't loaded, you get an IO error. If you're writing 
and you try to write to an inode who needs a key that isn't loaded, you 
get an IO error eventually.

This is what I'd prefer: Josef and Chris and I discussed this a while 
back and thought this was acceptable semantics for our uses, I think, 
and it's simple; you can already get an IO error if you're reading a 
file and suddenly hit an invalid extent, or if you are writing a file 
and run out of space.

2) all inodes store a fscrypt_context for all policies configured for 
the inode. ->get_context() gets a index parameter to get the index'th 
context. directory inodes' info is loaded off the one needed for dir 
entry encryption, leaf inodes' off the newest context. When any inode is 
opened, all the contexts are read and checked that their key is loaded, 
and then discarded, except the one in use for writes to that inode.

If a key is removed, you get an IO error when you do IO that needs that 
key, with the usual async conditions. Or, some lightweight structure is 
added to each master key's open inode list to point at the inode using 
it, and the inode keeps the master key around while it's open, thereby 
preventing IO errors after open.
> 
> Can you elaborate more on why you want this?  I was already a bit concerned
> about the plan for making it possible to set an encryption policy on an
> unencrypted directory, as that already diverges from the existing fscrypt
> semantics.  But now it sounds like the scope has grown even more.
> 
> Keep in mind that in general, the closer we are able to stick to the existing
> fscrypt semantics and design, the easier it is going to be to get the initial
> btrfs fscrypt support merged.  Planning ahead for new features is good, but we
> should also be careful not to overdesign.

Here's a shot at elaboration of usecase more:

On various machines, we currently have a btrfs filesystem containing 
various volumes/snapshots containing starting states for containers. The 
snapshots are generated by common snapshot images built centrally. The 
machines, as the scheduler requests, start and stop containers on those 
volumes.

We want to be able to start a container on a snapshot/volume such that 
every write to the relevant snapshot/volume is using a per-container 
key, but we don't care about reads of the starting snapshot image being 
encrypted since the starting snapshot image is widely shared. When the 
container is stopped, no future different container (or human or host 
program) knows its key. This limits the data which could be lost to a 
malicious service/human on the host to only the volumes containing 
currently running containers.

Some other folks envision having a base image encrypted with some 
per-vendor key. Then the machine is rekeyed with a per-machine key in 
perhaps the TPM to use for updates and logfiles. When a user is created, 
a snapshot of a base homedir forms the base of their user 
subvolume/directory, which is then rekeyed with a per-user key. When the 
user logs in, systemd-homedir or the like could load their per-user key 
for their user subvolume/directory.

Since we don't care about encrypting the common image, we initially 
envisioned unencrypted snapshot images where we then turn on encryption 
and have mixed unenc/enc data. The other usecase, though, really needs 
key change so that everything's encrypted. And the argument that mixed 
unenc/enc data is not safe was compelling.

Hope that helps?

Sweet Tea
Eric Biggers July 4, 2023, 12:28 a.m. UTC | #5
On Mon, Jul 03, 2023 at 04:37:52PM -0400, Sweet Tea Dorminy wrote:
> 
> On 7/3/23 14:17, Eric Riggers wrote:
> > On Mon, Jul 03, 2023 at 01:06:17PM -0400, Sweet Tea Dorminy wrote:
> > > > 
> > > > I think that would avoid many of the problems that it seems you've had to work
> > > > around or have had to change user-visible semantics for.  For example the
> > > > problems involving master keys being added and removed.  It would also avoid
> > > > having to overload 'fscrypt_info' to be either a per-inode or a per-extent key.
> > > > And it would save space on disk and in memory.
> > > 
> > > I might be misunderstanding what you're referencing, but I think you're
> > > talking about the change where with extent fscrypt, IO has to be forced down
> > > before removing a key, otherwise it is lost. I think that's a fundamental
> > > problem given the filesystem has no way to know that there are new, dirty
> > > pages in the pagecache until those pages are issued for write, so it can't
> > > create a new extent or few until that point, potentially after the relevant
> > > key has been evicted. But maybe I'm missing a hook that would let us make
> > > extents earlier.
> > > 
> > > I suppose we could give each leaf inode a proper nonce/prepared key instead
> > > of borrowing its parent dir's: if a write came in after the key was removed
> > > but the inode is still open, the new extent(s) could grab the key material
> > > out of the inode's info. I don't like this very much since it could result
> > > in multiple extents grabbing the same key material, but I suppose it could
> > > work if it's important to maintain that behavior.
> > 
> > Right, if extent keys are derived directly from the master key, and
> > FS_IOC_REMOVE_ENCRYPTION_KEY is executed which causes the master key secret to
> > be wiped, then no more extent keys can be derived.
> > 
> > So I see why you implemented the behavior you did.  It does seem a bit
> > dangerous, though.  If I understand correctly, under your proposal, if
> > FS_IOC_REMOVE_ENCRYPTION_KEY is executed while a process is doing buffered
> > writes to an encrypted file protected by that key, then past writes will get
> > synced out just before the key removal, but future writes will just be silently
> > thrown away.  (Remember, writes in Linux are asynchronous; processes don't get
> > informed of write errors unless they call fsync() or are doing direct I/O.)
> > I wonder if we should just keep the master key around, for per-extent key
> > derivation only, until all files using that master key have been closed.  That
> > would minimize the changes from the current fscrypt semantics.
> 
> That would work, if you're good with that. It felt like a larger deviation,
> with the possibility of unbounded continued use of a soft-removed master
> key;

Yes, it's not ideal, but essentially that same behavior already exists for the
encryption settings that don't use per-file keys.

The fact is that key removal with in-use files is always going to be
"incomplete".  People who really need the key removal to work properly need to
ensure that all files are closed first, or at least ensure that they're closed
eventually and keep retrying the key removal until then...

> should fscrypt restrict use of the master key only to extents that are
> part of inodes that are already open?

Yes, I think so.  That way the user-visible semantics would stay the same as is
currently the case for fscrypt.

> > > > Can you elaborate on why you went with a more "heavyweight" extents design?
> > > Being able to rekey a directory is the reason for having full contexts:
> > > suppose I take a snapshot of an encrypted dir and want to change the key for
> > > new data going forward, to avoid using a single key on too much data. It's
> > > too expensive to reencrypt every extent with the new key, since the whole
> > > point of a snapshot is to make a lightweight copy that gets COWed on write.
> > > Then each extent needs to know what its own master key
> > > identifier/policy/flags are, since different extents in the same file could
> > > have different master keys. We could say the mode and flags have to match,
> > > but it doesn't seem to me worth saving seven bytes to add a new structure to
> > > just store the master key identifier and nonce.
> > > 
> > > For a non-Meta usecase, from what I've heard from Fedora-land, it's possibly
> > > interesting to them to be able to ship an encrypted image, and then be able
> > > to change the key after encrypted install to something user-controlled.
> > > 
> > > Without rekeying, my understanding is that we may write too much data with
> > > one key for safety; notes in the updated design doc https://docs.google.com/document/d/1janjxewlewtVPqctkWOjSa7OhCgB8Gdx7iDaCDQQNZA/edit?usp=sharing
> > > are that writing more than 1P per key raises cryptographic concerns, and
> > > since btrfs is COW and could have volumes up to the full 16E size that btrfs
> > > supports, we don't want to have just one immutable key per subvol.
> > > 
> > > To me the lightweight-on-disk vision sounds a lot like the original design:
> > > https://lore.kernel.org/linux-btrfs/YXGyq+buM79A1S0L@relinquished.localdomain
> > > and the Nov '22 version of the patchset: https://lore.kernel.org/linux-btrfs/cover.1667389115.git.sweettea-kernel@dorminy.me/
> > > (which didn't have rekeying). I think rekeying is worth the higher disk
> > > usage; but I'm probably missing something about how your vision differs from
> > > the original. Could you please take a look at it again?
> > 
> > Your original design didn't use per-extent keys, but rather had a single
> > contents encryption key per master key.  We had discussed that that approach had
> > multiple disadvantages, one of which is that on btrfs it can run uncomfortably
> > close to the cryptographic limits for the contents encryption modes such as
> > AES-XTS.  So we decided to go with per-extent keys instead.
> 
> > I don't think we discussed cryptographic limits on the master key itself.  That
> > is actually much less of a concern, as the master key is just used for
> > HKDF-SHA512.  I don't think there is any real need to ever change the master
> > key.  (Well, if it is compromised, it could be needed, but that's not really
> > relevant here.  If that happens you'd need to re-encrypt everything anyway.)
> So you're proposing just storing a nonce per extent, then setting up a
> prepared key only, taking the inode's master key and doing the hkdf expand
> with the nonce into the prepared key? A terse approach, a lot closer to the
> original design than I thought.

Yes.

> > I do recall some discussion of making it possible to set an encryption policy on
> > an *unencrypted* directory, causing new files in that directory to be encrypted.
> > However, I don't recall any discussion of making it possible to add another
> > encryption policy to an *already-encrypted* directory.  I think this is the
> > first time this has been brought up.
> 
> I think I referenced it in the updated design (fifth paragraph of "Extent
> encryption" https://docs.google.com/document/d/1janjxewlewtVPqctkWOjSa7OhCgB8Gdx7iDaCDQQNZA/edit?usp=sharing)
> but I didn't talk about it enough -- 'rekeying' is a substitute for adding a
> policy to a directory full of unencrypted data. Ya'll's points about the
> badness of having mixed unencrypted and encrypted data in a single dir were
> compelling. (As I recall it, the issue with having mixed enc/unenc data is
> that a bug or attacker could point an encrypted file autostarted in a
> container, say /container/my-service, at a unencrypted extent under their
> control, say /bin/bash, and thereby acquire a backdoor.)
> > 
> > I think that allowing directories to have multiple encryption policies would
> > bring in a lot of complexity.  How would it be configured, and what would the
> > semantics for accessing it be?  Where would the encryption policies be stored?
> > What if you have added some of the keys but not all of them?  What if some of
> > the keys get removed but not all of them?
> I'd planned to use add_enckey to add all the necessary keys, set_encpolicy
> on an encrypted directory under the proper conditions (flags interpreted by
> ioctl? check if filesystem has hook?) recursively calls a
> filesystem-provided hook on each inode within to change the fscrypt_context.

That sounds quite complex.  Recursive directory operations aren't really
something the kernel does.  It would also require updating every inode, causing
COW of every inode.  Isn't that something you'd really like to avoid, to make
starting a new container as fast and lightweight as possible?

> Either dir items could be reencrypted with the new key or dirs could keep
> around the oldest key for encryption and the newest key for new leaf inodes
> to inherit.
> 
> For leaf inodes, 2 options for the remaining questions:
> 
> 1) leaf inodes just stores the policy to be used for writing new data or
> inherits it from a parent directory. If you're reading and you hit an extent
> whose key isn't loaded, you get an IO error. If you're writing and you try
> to write to an inode who needs a key that isn't loaded, you get an IO error
> eventually.
> 
> This is what I'd prefer: Josef and Chris and I discussed this a while back
> and thought this was acceptable semantics for our uses, I think, and it's
> simple; you can already get an IO error if you're reading a file and
> suddenly hit an invalid extent, or if you are writing a file and run out of
> space.
> 
> 2) all inodes store a fscrypt_context for all policies configured for the
> inode. ->get_context() gets a index parameter to get the index'th context.
> directory inodes' info is loaded off the one needed for dir entry
> encryption, leaf inodes' off the newest context. When any inode is opened,
> all the contexts are read and checked that their key is loaded, and then
> discarded, except the one in use for writes to that inode.
> 
> If a key is removed, you get an IO error when you do IO that needs that key,
> with the usual async conditions. Or, some lightweight structure is added to
> each master key's open inode list to point at the inode using it, and the
> inode keeps the master key around while it's open, thereby preventing IO
> errors after open.
> > 
> > Can you elaborate more on why you want this?  I was already a bit concerned
> > about the plan for making it possible to set an encryption policy on an
> > unencrypted directory, as that already diverges from the existing fscrypt
> > semantics.  But now it sounds like the scope has grown even more.
> > 
> > Keep in mind that in general, the closer we are able to stick to the existing
> > fscrypt semantics and design, the easier it is going to be to get the initial
> > btrfs fscrypt support merged.  Planning ahead for new features is good, but we
> > should also be careful not to overdesign.
> 
> Here's a shot at elaboration of usecase more:
> 
> On various machines, we currently have a btrfs filesystem containing various
> volumes/snapshots containing starting states for containers. The snapshots
> are generated by common snapshot images built centrally. The machines, as
> the scheduler requests, start and stop containers on those volumes.
> 
> We want to be able to start a container on a snapshot/volume such that every
> write to the relevant snapshot/volume is using a per-container key, but we
> don't care about reads of the starting snapshot image being encrypted since
> the starting snapshot image is widely shared. When the container is stopped,
> no future different container (or human or host program) knows its key. This
> limits the data which could be lost to a malicious service/human on the host
> to only the volumes containing currently running containers.
> 
> Some other folks envision having a base image encrypted with some per-vendor
> key. Then the machine is rekeyed with a per-machine key in perhaps the TPM
> to use for updates and logfiles. When a user is created, a snapshot of a
> base homedir forms the base of their user subvolume/directory, which is then
> rekeyed with a per-user key. When the user logs in, systemd-homedir or the
> like could load their per-user key for their user subvolume/directory.
> 
> Since we don't care about encrypting the common image, we initially
> envisioned unencrypted snapshot images where we then turn on encryption and
> have mixed unenc/enc data. The other usecase, though, really needs key
> change so that everything's encrypted. And the argument that mixed unenc/enc
> data is not safe was compelling.
> 
> Hope that helps?

Maybe a dumb question: why aren't you just using overlayfs?  It's already
possible to use overlayfs with an fscrypt-encrypted upperdir and workdir.  When
creating a new container you can create a new directory and assign it an fscrypt
policy (with a per-container or per-user key or whatever that container wants),
and create two subdirectories 'upperdir' and 'workdir' in it.  Then just mount
an overlayfs with that upperdir and workdir, and lowerdir referring to the
starting rootfs.  Then use that overlayfs as the rootfs as the container.

Wouldn't that solve your use case exactly?  Is there a reason you really want to
create the container directly from a btrfs snapshot instead?

- Eric
Sweet Tea Dorminy July 4, 2023, 1:57 a.m. UTC | #6
>>> I do recall some discussion of making it possible to set an encryption policy on
>>> an *unencrypted* directory, causing new files in that directory to be encrypted.
>>> However, I don't recall any discussion of making it possible to add another
>>> encryption policy to an *already-encrypted* directory.  I think this is the
>>> first time this has been brought up.
>>
>> I think I referenced it in the updated design (fifth paragraph of "Extent
>> encryption" https://docs.google.com/document/d/1janjxewlewtVPqctkWOjSa7OhCgB8Gdx7iDaCDQQNZA/edit?usp=sharing)
>> but I didn't talk about it enough -- 'rekeying' is a substitute for adding a
>> policy to a directory full of unencrypted data. Ya'll's points about the
>> badness of having mixed unencrypted and encrypted data in a single dir were
>> compelling. (As I recall it, the issue with having mixed enc/unenc data is
>> that a bug or attacker could point an encrypted file autostarted in a
>> container, say /container/my-service, at a unencrypted extent under their
>> control, say /bin/bash, and thereby acquire a backdoor.)
>>>
>>> I think that allowing directories to have multiple encryption policies would
>>> bring in a lot of complexity.  How would it be configured, and what would the
>>> semantics for accessing it be?  Where would the encryption policies be stored?
>>> What if you have added some of the keys but not all of them?  What if some of
>>> the keys get removed but not all of them?
>> I'd planned to use add_enckey to add all the necessary keys, set_encpolicy
>> on an encrypted directory under the proper conditions (flags interpreted by
>> ioctl? check if filesystem has hook?) recursively calls a
>> filesystem-provided hook on each inode within to change the fscrypt_context.
> 
> That sounds quite complex.  Recursive directory operations aren't really
> something the kernel does.  It would also require updating every inode, causing
> COW of every inode.  Isn't that something you'd really like to avoid, to make
> starting a new container as fast and lightweight as possible?

A fair point. Can move the penalty to open or write time instead though: 
btrfs could store a generation number with the new context on only the 
directory changed, then leaf inodes or new extent can traverse up the 
directory tree and grab context from the highest-generation-number 
directory in its path to inherit from. Or btrfs could disallow changing 
except on the base of a subvolume, and just go directly to the top of 
the subvolume to grab the appropriate context. Neither of those require 
recursion outside btrfs.

>> On various machines, we currently have a btrfs filesystem containing various
>> volumes/snapshots containing starting states for containers. The snapshots
>> are generated by common snapshot images built centrally. The machines, as
>> the scheduler requests, start and stop containers on those volumes.
>>
>> We want to be able to start a container on a snapshot/volume such that every
>> write to the relevant snapshot/volume is using a per-container key, but we
>> don't care about reads of the starting snapshot image being encrypted since
>> the starting snapshot image is widely shared. When the container is stopped,
>> no future different container (or human or host program) knows its key. This
>> limits the data which could be lost to a malicious service/human on the host
>> to only the volumes containing currently running containers.
>>
>> Some other folks envision having a base image encrypted with some per-vendor
>> key. Then the machine is rekeyed with a per-machine key in perhaps the TPM
>> to use for updates and logfiles. When a user is created, a snapshot of a
>> base homedir forms the base of their user subvolume/directory, which is then
>> rekeyed with a per-user key. When the user logs in, systemd-homedir or the
>> like could load their per-user key for their user subvolume/directory.
>>
>> Since we don't care about encrypting the common image, we initially
>> envisioned unencrypted snapshot images where we then turn on encryption and
>> have mixed unenc/enc data. The other usecase, though, really needs key
>> change so that everything's encrypted. And the argument that mixed unenc/enc
>> data is not safe was compelling.
>>
>> Hope that helps?
> 
> Maybe a dumb question: why aren't you just using overlayfs?  It's already
> possible to use overlayfs with an fscrypt-encrypted upperdir and workdir.  When
> creating a new container you can create a new directory and assign it an fscrypt
> policy (with a per-container or per-user key or whatever that container wants),
> and create two subdirectories 'upperdir' and 'workdir' in it.  Then just mount
> an overlayfs with that upperdir and workdir, and lowerdir referring to the
> starting rootfs.  Then use that overlayfs as the rootfs as the container.
> 
> Wouldn't that solve your use case exactly?  Is there a reason you really want to
> create the container directly from a btrfs snapshot instead?

Hardly; a quite intriguing idea. Let me think about this with folks when 
we get back to work on Wednesday. Not sure how it goes with the other 
usecase, the base image/per-machine/per-user combo, but will think about it.
Neal Gompa July 5, 2023, 12:13 p.m. UTC | #7
On Mon, Jul 3, 2023 at 10:03 PM Sweet Tea Dorminy
<sweettea-kernel@dorminy.me> wrote:
>
>
> >>> I do recall some discussion of making it possible to set an encryption policy on
> >>> an *unencrypted* directory, causing new files in that directory to be encrypted.
> >>> However, I don't recall any discussion of making it possible to add another
> >>> encryption policy to an *already-encrypted* directory.  I think this is the
> >>> first time this has been brought up.
> >>
> >> I think I referenced it in the updated design (fifth paragraph of "Extent
> >> encryption" https://docs.google.com/document/d/1janjxewlewtVPqctkWOjSa7OhCgB8Gdx7iDaCDQQNZA/edit?usp=sharing)
> >> but I didn't talk about it enough -- 'rekeying' is a substitute for adding a
> >> policy to a directory full of unencrypted data. Ya'll's points about the
> >> badness of having mixed unencrypted and encrypted data in a single dir were
> >> compelling. (As I recall it, the issue with having mixed enc/unenc data is
> >> that a bug or attacker could point an encrypted file autostarted in a
> >> container, say /container/my-service, at a unencrypted extent under their
> >> control, say /bin/bash, and thereby acquire a backdoor.)
> >>>
> >>> I think that allowing directories to have multiple encryption policies would
> >>> bring in a lot of complexity.  How would it be configured, and what would the
> >>> semantics for accessing it be?  Where would the encryption policies be stored?
> >>> What if you have added some of the keys but not all of them?  What if some of
> >>> the keys get removed but not all of them?
> >> I'd planned to use add_enckey to add all the necessary keys, set_encpolicy
> >> on an encrypted directory under the proper conditions (flags interpreted by
> >> ioctl? check if filesystem has hook?) recursively calls a
> >> filesystem-provided hook on each inode within to change the fscrypt_context.
> >
> > That sounds quite complex.  Recursive directory operations aren't really
> > something the kernel does.  It would also require updating every inode, causing
> > COW of every inode.  Isn't that something you'd really like to avoid, to make
> > starting a new container as fast and lightweight as possible?
>
> A fair point. Can move the penalty to open or write time instead though:
> btrfs could store a generation number with the new context on only the
> directory changed, then leaf inodes or new extent can traverse up the
> directory tree and grab context from the highest-generation-number
> directory in its path to inherit from. Or btrfs could disallow changing
> except on the base of a subvolume, and just go directly to the top of
> the subvolume to grab the appropriate context. Neither of those require
> recursion outside btrfs.
>
> >> On various machines, we currently have a btrfs filesystem containing various
> >> volumes/snapshots containing starting states for containers. The snapshots
> >> are generated by common snapshot images built centrally. The machines, as
> >> the scheduler requests, start and stop containers on those volumes.
> >>
> >> We want to be able to start a container on a snapshot/volume such that every
> >> write to the relevant snapshot/volume is using a per-container key, but we
> >> don't care about reads of the starting snapshot image being encrypted since
> >> the starting snapshot image is widely shared. When the container is stopped,
> >> no future different container (or human or host program) knows its key. This
> >> limits the data which could be lost to a malicious service/human on the host
> >> to only the volumes containing currently running containers.
> >>
> >> Some other folks envision having a base image encrypted with some per-vendor
> >> key. Then the machine is rekeyed with a per-machine key in perhaps the TPM
> >> to use for updates and logfiles. When a user is created, a snapshot of a
> >> base homedir forms the base of their user subvolume/directory, which is then
> >> rekeyed with a per-user key. When the user logs in, systemd-homedir or the
> >> like could load their per-user key for their user subvolume/directory.
> >>
> >> Since we don't care about encrypting the common image, we initially
> >> envisioned unencrypted snapshot images where we then turn on encryption and
> >> have mixed unenc/enc data. The other usecase, though, really needs key
> >> change so that everything's encrypted. And the argument that mixed unenc/enc
> >> data is not safe was compelling.
> >>
> >> Hope that helps?
> >
> > Maybe a dumb question: why aren't you just using overlayfs?  It's already
> > possible to use overlayfs with an fscrypt-encrypted upperdir and workdir.  When
> > creating a new container you can create a new directory and assign it an fscrypt
> > policy (with a per-container or per-user key or whatever that container wants),
> > and create two subdirectories 'upperdir' and 'workdir' in it.  Then just mount
> > an overlayfs with that upperdir and workdir, and lowerdir referring to the
> > starting rootfs.  Then use that overlayfs as the rootfs as the container.
> >
> > Wouldn't that solve your use case exactly?  Is there a reason you really want to
> > create the container directly from a btrfs snapshot instead?
>
> Hardly; a quite intriguing idea. Let me think about this with folks when
> we get back to work on Wednesday. Not sure how it goes with the other
> usecase, the base image/per-machine/per-user combo, but will think about it.

I like creating containers directly based on my host system for
development and destructive purposes. It saves space and is incredibly
useful.

But the layered key encryption thing is also core to the encryption
strategy we want to take in Fedora, so I would really like to see this
be possible with Btrfs encryption.

Critically, it means that unlocking a user subvolume will always be
multi-factor: something you have (machine key) and something you know
(user credentials).
Eric Biggers July 5, 2023, 4:28 p.m. UTC | #8
On Wed, Jul 05, 2023 at 08:13:34AM -0400, Neal Gompa wrote:
> On Mon, Jul 3, 2023 at 10:03 PM Sweet Tea Dorminy
> <sweettea-kernel@dorminy.me> wrote:
> >
> >
> > >>> I do recall some discussion of making it possible to set an encryption policy on
> > >>> an *unencrypted* directory, causing new files in that directory to be encrypted.
> > >>> However, I don't recall any discussion of making it possible to add another
> > >>> encryption policy to an *already-encrypted* directory.  I think this is the
> > >>> first time this has been brought up.
> > >>
> > >> I think I referenced it in the updated design (fifth paragraph of "Extent
> > >> encryption" https://docs.google.com/document/d/1janjxewlewtVPqctkWOjSa7OhCgB8Gdx7iDaCDQQNZA/edit?usp=sharing)
> > >> but I didn't talk about it enough -- 'rekeying' is a substitute for adding a
> > >> policy to a directory full of unencrypted data. Ya'll's points about the
> > >> badness of having mixed unencrypted and encrypted data in a single dir were
> > >> compelling. (As I recall it, the issue with having mixed enc/unenc data is
> > >> that a bug or attacker could point an encrypted file autostarted in a
> > >> container, say /container/my-service, at a unencrypted extent under their
> > >> control, say /bin/bash, and thereby acquire a backdoor.)
> > >>>
> > >>> I think that allowing directories to have multiple encryption policies would
> > >>> bring in a lot of complexity.  How would it be configured, and what would the
> > >>> semantics for accessing it be?  Where would the encryption policies be stored?
> > >>> What if you have added some of the keys but not all of them?  What if some of
> > >>> the keys get removed but not all of them?
> > >> I'd planned to use add_enckey to add all the necessary keys, set_encpolicy
> > >> on an encrypted directory under the proper conditions (flags interpreted by
> > >> ioctl? check if filesystem has hook?) recursively calls a
> > >> filesystem-provided hook on each inode within to change the fscrypt_context.
> > >
> > > That sounds quite complex.  Recursive directory operations aren't really
> > > something the kernel does.  It would also require updating every inode, causing
> > > COW of every inode.  Isn't that something you'd really like to avoid, to make
> > > starting a new container as fast and lightweight as possible?
> >
> > A fair point. Can move the penalty to open or write time instead though:
> > btrfs could store a generation number with the new context on only the
> > directory changed, then leaf inodes or new extent can traverse up the
> > directory tree and grab context from the highest-generation-number
> > directory in its path to inherit from. Or btrfs could disallow changing
> > except on the base of a subvolume, and just go directly to the top of
> > the subvolume to grab the appropriate context. Neither of those require
> > recursion outside btrfs.
> >
> > >> On various machines, we currently have a btrfs filesystem containing various
> > >> volumes/snapshots containing starting states for containers. The snapshots
> > >> are generated by common snapshot images built centrally. The machines, as
> > >> the scheduler requests, start and stop containers on those volumes.
> > >>
> > >> We want to be able to start a container on a snapshot/volume such that every
> > >> write to the relevant snapshot/volume is using a per-container key, but we
> > >> don't care about reads of the starting snapshot image being encrypted since
> > >> the starting snapshot image is widely shared. When the container is stopped,
> > >> no future different container (or human or host program) knows its key. This
> > >> limits the data which could be lost to a malicious service/human on the host
> > >> to only the volumes containing currently running containers.
> > >>
> > >> Some other folks envision having a base image encrypted with some per-vendor
> > >> key. Then the machine is rekeyed with a per-machine key in perhaps the TPM
> > >> to use for updates and logfiles. When a user is created, a snapshot of a
> > >> base homedir forms the base of their user subvolume/directory, which is then
> > >> rekeyed with a per-user key. When the user logs in, systemd-homedir or the
> > >> like could load their per-user key for their user subvolume/directory.
> > >>
> > >> Since we don't care about encrypting the common image, we initially
> > >> envisioned unencrypted snapshot images where we then turn on encryption and
> > >> have mixed unenc/enc data. The other usecase, though, really needs key
> > >> change so that everything's encrypted. And the argument that mixed unenc/enc
> > >> data is not safe was compelling.
> > >>
> > >> Hope that helps?
> > >
> > > Maybe a dumb question: why aren't you just using overlayfs?  It's already
> > > possible to use overlayfs with an fscrypt-encrypted upperdir and workdir.  When
> > > creating a new container you can create a new directory and assign it an fscrypt
> > > policy (with a per-container or per-user key or whatever that container wants),
> > > and create two subdirectories 'upperdir' and 'workdir' in it.  Then just mount
> > > an overlayfs with that upperdir and workdir, and lowerdir referring to the
> > > starting rootfs.  Then use that overlayfs as the rootfs as the container.
> > >
> > > Wouldn't that solve your use case exactly?  Is there a reason you really want to
> > > create the container directly from a btrfs snapshot instead?
> >
> > Hardly; a quite intriguing idea. Let me think about this with folks when
> > we get back to work on Wednesday. Not sure how it goes with the other
> > usecase, the base image/per-machine/per-user combo, but will think about it.
> 
> I like creating containers directly based on my host system for
> development and destructive purposes. It saves space and is incredibly
> useful.

A solution for that already exists.  It's called btrfs snapshots.  Which you
probably already know, since it's probably what you're using :-)

Using overlayfs would simply mean that each container consists of an upper and
lower directory instead of a single directory.  Either or both could still be
btrfs subvolumes.  They could even be on the same subvolume.

> 
> But the layered key encryption thing is also core to the encryption
> strategy we want to take in Fedora, so I would really like to see this
> be possible with Btrfs encryption.
> 
> Critically, it means that unlocking a user subvolume will always be
> multi-factor: something you have (machine key) and something you know
> (user credentials).

That's possible with the existing fscrypt semantics.  Just use a unique master
key for each container, and protect it with a key derived from both the machine
key *and* the user credential.  Protecting the fscrypt master key(s) is a
userspace problem, not a kernel one.  The kernel just receives the raw key.

- Eric
Sweet Tea Dorminy July 5, 2023, 5 p.m. UTC | #9
>> I like creating containers directly based on my host system for
>> development and destructive purposes. It saves space and is incredibly
>> useful.
> 
> A solution for that already exists.  It's called btrfs snapshots.  Which you
> probably already know, since it's probably what you're using :-)
> 
> Using overlayfs would simply mean that each container consists of an upper and
> lower directory instead of a single directory.  Either or both could still be
> btrfs subvolumes.  They could even be on the same subvolume.

This isn't a full response, still researching details of our setup and 
whether overlayfs could work for us. But in re this particular usecase:

as I understand it, the lower layer for overlayfs is immutable. So if 
you set up a container/VM image in this way with overlayfs, you end up 
not being able to reclaim space from the original image, if e.g. you 
uninstall or upgrade a package present in the original.

This wastes disk space, and, if you want to migrate that subvol over to 
another machine/disk via btrfs send/receive, wastes network/CPU. 
Hopefully not a lot -- hopefully your container image is mostly 
immutable -- but for long-lived VMs like I usually use personally it 
could be basically the whole VM image wasted.
Sweet Tea Dorminy July 5, 2023, 7:41 p.m. UTC | #10
>>
>> Here's a shot at elaboration of usecase more:
>>
>> On various machines, we currently have a btrfs filesystem containing various
>> volumes/snapshots containing starting states for containers. The snapshots
>> are generated by common snapshot images built centrally. The machines, as
>> the scheduler requests, start and stop containers on those volumes.
>>
>> We want to be able to start a container on a snapshot/volume such that every
>> write to the relevant snapshot/volume is using a per-container key, but we
>> don't care about reads of the starting snapshot image being encrypted since
>> the starting snapshot image is widely shared. When the container is stopped,
>> no future different container (or human or host program) knows its key. This
>> limits the data which could be lost to a malicious service/human on the host
>> to only the volumes containing currently running containers.
>>
>> Some other folks envision having a base image encrypted with some per-vendor
>> key. Then the machine is rekeyed with a per-machine key in perhaps the TPM
>> to use for updates and logfiles. When a user is created, a snapshot of a
>> base homedir forms the base of their user subvolume/directory, which is then
>> rekeyed with a per-user key. When the user logs in, systemd-homedir or the
>> like could load their per-user key for their user subvolume/directory.
>>
>> Since we don't care about encrypting the common image, we initially
>> envisioned unencrypted snapshot images where we then turn on encryption and
>> have mixed unenc/enc data. The other usecase, though, really needs key
>> change so that everything's encrypted. And the argument that mixed unenc/enc
>> data is not safe was compelling.
>>
>> Hope that helps?
> 
> Maybe a dumb question: why aren't you just using overlayfs?  It's already
> possible to use overlayfs with an fscrypt-encrypted upperdir and workdir.  When
> creating a new container you can create a new directory and assign it an fscrypt
> policy (with a per-container or per-user key or whatever that container wants),
> and create two subdirectories 'upperdir' and 'workdir' in it.  Then just mount
> an overlayfs with that upperdir and workdir, and lowerdir referring to the
> starting rootfs.  Then use that overlayfs as the rootfs as the container.
> 
> Wouldn't that solve your use case exactly?  Is there a reason you really want to
> create the container directly from a btrfs snapshot instead?

After talking it over, nested containers/subvols don't work easily with 
this scheme. Right now, one can make arbitrarily nested subvols inside 
of subvols, so e.g. a container which only sees /subvol can make subvol 
/subvol/nested without elevated permissions, and a container which only 
sees /subvol/nested could make yet another nested subvol 
/subvol/nested/foo/nested2, ad infinitum. There aren't afaik limits on 
the recursive depth of subvols or containers, or limits on how close 
they are in the directory tree.

This isn't purely theoretical; I learned today there are a couple of 
workloads internally which run in a long-lived container on a subvol, 
and spin up a bunch of short-lived containers on shortlived subvols 
inside the long-lived container/subvol.

I don't think the overlayfs scheme works with this. From the point of 
view of the container overlayfs would be presenting a wholly encrypted 
filesystem (which is what we want). But from the container, even if we 
plumbed through making a new subvol within, it'd be hard to create a new 
overlayfs upper directory with a new key for a nested container, if dirs 
had to have the same key as their parent dir unless that's unencrypted. 
We'd need to allow the parent container to escape into an unencrypted 
directory to make a new encrypted upperdir for the nested container, 
which would defeat having the container only able to write to encrypted 
locations. I can't come up with a way to make the overlayfs scheme work 
with this, but maybe I don't know overlayfs well enough.

A decidedly intriguing idea! Thanks

Sweet Tea