diff mbox

[03/18] f2fs crypto: declare some definitions for f2fs encryption feature

Message ID 1431145253-2019-3-git-send-email-jaegeuk@kernel.org (mailing list archive)
State New, archived
Headers show

Commit Message

Jaegeuk Kim May 9, 2015, 4:20 a.m. UTC
This definitions will be used by inode and superblock for encyption.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
---
 fs/f2fs/f2fs.h          |  54 ++++++++++++++++++
 fs/f2fs/f2fs_crypto.h   | 149 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/f2fs_fs.h |   4 +-
 3 files changed, 206 insertions(+), 1 deletion(-)
 create mode 100644 fs/f2fs/f2fs_crypto.h

Comments

Dave Chinner May 13, 2015, 2:02 a.m. UTC | #1
On Fri, May 08, 2015 at 09:20:38PM -0700, Jaegeuk Kim wrote:
> This definitions will be used by inode and superblock for encyption.

How much of this crypto stuff is common with or only slightly
modified from the ext4 code?  Is the behaviour and features the
same? Is the user API and management tools the same?

IMO, if there is any amount of overlap, then we should be
implementing this stuff as generic code, not propagating the same
code through multiple filesystems via copy-n-paste-n-modify. This
will simply end up with diverging code, different bugs and feature
sets, and none of the implementations will get the review and
maintenance they really require...

And, FWIW, this is the reason why I originally asked for the ext4
encryption code to be pulled up to the VFS: precisely so we didn't
end up with a rapid proliferation of individual in-filesystem
encryption implementations that are all slightly different...

Cheers,

Dave.
Jaegeuk Kim May 13, 2015, 6:48 a.m. UTC | #2
On Wed, May 13, 2015 at 12:02:08PM +1000, Dave Chinner wrote:
> On Fri, May 08, 2015 at 09:20:38PM -0700, Jaegeuk Kim wrote:
> > This definitions will be used by inode and superblock for encyption.
> 
> How much of this crypto stuff is common with or only slightly
> modified from the ext4 code?  Is the behaviour and features the
> same? Is the user API and management tools the same?
> 
> IMO, if there is any amount of overlap, then we should be
> implementing this stuff as generic code, not propagating the same
> code through multiple filesystems via copy-n-paste-n-modify. This
> will simply end up with diverging code, different bugs and feature
> sets, and none of the implementations will get the review and
> maintenance they really require...
> 
> And, FWIW, this is the reason why I originally asked for the ext4
> encryption code to be pulled up to the VFS: precisely so we didn't
> end up with a rapid proliferation of individual in-filesystem
> encryption implementations that are all slightly different...

Totally agreed!

AFAIK, Ted wants to push the codes as a crypto library into fs/ finally, so
I believe most part of crypto codes are common.

But, in order to realize that quickly, Ted implemented the feature to finalize
on-disk and in-memory design in EXT4 as a first step.
Then, I've been catching up and validating its design by implementing it in
F2FS, which also intends to figure out what crypto codes can be exactly common.

As Ted mentioned before, since next android version tries to use per-file
encryption, F2FS also needs to support it as quick as possible likewise EXT4.

Meanwhile, surely I've been working on writing patches to push them into fs/;
currenlty, I did for cryto.c and will do for crypto_key.c and crypto_fname.c.
But, it needs to think about crypto_policy.c differently, since it may depend
on how each filesystem stores the policy information respectively; we cannot
push all the filesystems should use xattrs, right?

Anyway, let me take a time to work on this and submit RFC patches.

Thanks,
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner May 14, 2015, 12:37 a.m. UTC | #3
On Tue, May 12, 2015 at 11:48:02PM -0700, Jaegeuk Kim wrote:
> On Wed, May 13, 2015 at 12:02:08PM +1000, Dave Chinner wrote:
> > On Fri, May 08, 2015 at 09:20:38PM -0700, Jaegeuk Kim wrote:
> > > This definitions will be used by inode and superblock for encyption.
> > 
> > How much of this crypto stuff is common with or only slightly
> > modified from the ext4 code?  Is the behaviour and features the
> > same? Is the user API and management tools the same?
> > 
> > IMO, if there is any amount of overlap, then we should be
> > implementing this stuff as generic code, not propagating the same
> > code through multiple filesystems via copy-n-paste-n-modify. This
> > will simply end up with diverging code, different bugs and feature
> > sets, and none of the implementations will get the review and
> > maintenance they really require...
> > 
> > And, FWIW, this is the reason why I originally asked for the ext4
> > encryption code to be pulled up to the VFS: precisely so we didn't
> > end up with a rapid proliferation of individual in-filesystem
> > encryption implementations that are all slightly different...
> 
> Totally agreed!
> 
> AFAIK, Ted wants to push the codes as a crypto library into fs/ finally, so
> I believe most part of crypto codes are common.

Can I suggest fs/crypto/ if there are going to be multiple files?

> But, in order to realize that quickly, Ted implemented the feature to finalize
> on-disk and in-memory design in EXT4 as a first step.
> Then, I've been catching up and validating its design by implementing it in
> F2FS, which also intends to figure out what crypto codes can be exactly common.

Excellent. That will make it easier and less error prone for other
filesystems to implement it, too!

> As Ted mentioned before, since next android version tries to use per-file
> encryption, F2FS also needs to support it as quick as possible likewise EXT4.

Fair enough.

> Meanwhile, surely I've been working on writing patches to push them into fs/;
> currenlty, I did for cryto.c and will do for crypto_key.c and crypto_fname.c.
> But, it needs to think about crypto_policy.c differently, since it may depend
> on how each filesystem stores the policy information respectively; we cannot
> push all the filesystems should use xattrs, right?

All filesystems likely to implement per-file crypto support xattrs,
and this is exactly what xattrs are designed for. e.g. we already
require xattrs for generic security labels, ACLs, etc. Hence
per-file crypto information should also use a common, shared xattr
format. That way it only needs to be implemented once in the generic
code and there's very little (hopefully nothing!) each filesystem
has to customise to store the crypto information for each file.

Cheers,

Dave.
Jaegeuk Kim May 14, 2015, 1:56 a.m. UTC | #4
On Thu, May 14, 2015 at 10:37:21AM +1000, Dave Chinner wrote:
> On Tue, May 12, 2015 at 11:48:02PM -0700, Jaegeuk Kim wrote:
> > On Wed, May 13, 2015 at 12:02:08PM +1000, Dave Chinner wrote:
> > > On Fri, May 08, 2015 at 09:20:38PM -0700, Jaegeuk Kim wrote:
> > > > This definitions will be used by inode and superblock for encyption.
> > > 
> > > How much of this crypto stuff is common with or only slightly
> > > modified from the ext4 code?  Is the behaviour and features the
> > > same? Is the user API and management tools the same?
> > > 
> > > IMO, if there is any amount of overlap, then we should be
> > > implementing this stuff as generic code, not propagating the same
> > > code through multiple filesystems via copy-n-paste-n-modify. This
> > > will simply end up with diverging code, different bugs and feature
> > > sets, and none of the implementations will get the review and
> > > maintenance they really require...
> > > 
> > > And, FWIW, this is the reason why I originally asked for the ext4
> > > encryption code to be pulled up to the VFS: precisely so we didn't
> > > end up with a rapid proliferation of individual in-filesystem
> > > encryption implementations that are all slightly different...
> > 
> > Totally agreed!
> > 
> > AFAIK, Ted wants to push the codes as a crypto library into fs/ finally, so
> > I believe most part of crypto codes are common.
> 
> Can I suggest fs/crypto/ if there are going to be multiple files?

No problem at all. I'll do.

> 
> > But, in order to realize that quickly, Ted implemented the feature to finalize
> > on-disk and in-memory design in EXT4 as a first step.
> > Then, I've been catching up and validating its design by implementing it in
> > F2FS, which also intends to figure out what crypto codes can be exactly common.
> 
> Excellent. That will make it easier and less error prone for other
> filesystems to implement it, too!
> 
> > As Ted mentioned before, since next android version tries to use per-file
> > encryption, F2FS also needs to support it as quick as possible likewise EXT4.
> 
> Fair enough.
> 
> > Meanwhile, surely I've been working on writing patches to push them into fs/;
> > currenlty, I did for cryto.c and will do for crypto_key.c and crypto_fname.c.
> > But, it needs to think about crypto_policy.c differently, since it may depend
> > on how each filesystem stores the policy information respectively; we cannot
> > push all the filesystems should use xattrs, right?
> 
> All filesystems likely to implement per-file crypto support xattrs,
> and this is exactly what xattrs are designed for. e.g. we already
> require xattrs for generic security labels, ACLs, etc. Hence
> per-file crypto information should also use a common, shared xattr
> format. That way it only needs to be implemented once in the generic
> code and there's very little (hopefully nothing!) each filesystem
> has to customise to store the crypto information for each file.

Ok, I see. Let me take a look at that too.
Thank you for sharing your thoughts. :)

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Marshall May 14, 2015, 4:50 p.m. UTC | #5
Please keep in mind that I'm also working on transparent compression.  
I'm watching this thread closely so that I can implement a compression 
library alongside the crypto library.  If there is any interest or 
benefit, I would be glad to work together so that the two can be done 
cooperatively at the same time.

On 05/13/2015 06:56 PM, Jaegeuk Kim wrote:
> On Thu, May 14, 2015 at 10:37:21AM +1000, Dave Chinner wrote:
>> On Tue, May 12, 2015 at 11:48:02PM -0700, Jaegeuk Kim wrote:
>>> On Wed, May 13, 2015 at 12:02:08PM +1000, Dave Chinner wrote:
>>>> On Fri, May 08, 2015 at 09:20:38PM -0700, Jaegeuk Kim wrote:
>>>>> This definitions will be used by inode and superblock for encyption.
>>>> How much of this crypto stuff is common with or only slightly
>>>> modified from the ext4 code?  Is the behaviour and features the
>>>> same? Is the user API and management tools the same?
>>>>
>>>> IMO, if there is any amount of overlap, then we should be
>>>> implementing this stuff as generic code, not propagating the same
>>>> code through multiple filesystems via copy-n-paste-n-modify. This
>>>> will simply end up with diverging code, different bugs and feature
>>>> sets, and none of the implementations will get the review and
>>>> maintenance they really require...
>>>>
>>>> And, FWIW, this is the reason why I originally asked for the ext4
>>>> encryption code to be pulled up to the VFS: precisely so we didn't
>>>> end up with a rapid proliferation of individual in-filesystem
>>>> encryption implementations that are all slightly different...
>>> Totally agreed!
>>>
>>> AFAIK, Ted wants to push the codes as a crypto library into fs/ finally, so
>>> I believe most part of crypto codes are common.
>> Can I suggest fs/crypto/ if there are going to be multiple files?
> No problem at all. I'll do.
>
>>> But, in order to realize that quickly, Ted implemented the feature to finalize
>>> on-disk and in-memory design in EXT4 as a first step.
>>> Then, I've been catching up and validating its design by implementing it in
>>> F2FS, which also intends to figure out what crypto codes can be exactly common.
>> Excellent. That will make it easier and less error prone for other
>> filesystems to implement it, too!
>>
>>> As Ted mentioned before, since next android version tries to use per-file
>>> encryption, F2FS also needs to support it as quick as possible likewise EXT4.
>> Fair enough.
>>
>>> Meanwhile, surely I've been working on writing patches to push them into fs/;
>>> currenlty, I did for cryto.c and will do for crypto_key.c and crypto_fname.c.
>>> But, it needs to think about crypto_policy.c differently, since it may depend
>>> on how each filesystem stores the policy information respectively; we cannot
>>> push all the filesystems should use xattrs, right?
>> All filesystems likely to implement per-file crypto support xattrs,
>> and this is exactly what xattrs are designed for. e.g. we already
>> require xattrs for generic security labels, ACLs, etc. Hence
>> per-file crypto information should also use a common, shared xattr
>> format. That way it only needs to be implemented once in the generic
>> code and there's very little (hopefully nothing!) each filesystem
>> has to customise to store the crypto information for each file.
> Ok, I see. Let me take a look at that too.
> Thank you for sharing your thoughts. :)
>
>> Cheers,
>>
>> Dave.
>> -- 
>> Dave Chinner
>> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jaegeuk Kim May 16, 2015, 1:14 a.m. UTC | #6
On Thu, May 14, 2015 at 09:50:44AM -0700, Tom Marshall wrote:
> Please keep in mind that I'm also working on transparent
> compression.  I'm watching this thread closely so that I can
> implement a compression library alongside the crypto library.  If
> there is any interest or benefit, I would be glad to work together
> so that the two can be done cooperatively at the same time.

I can't imagine quickly how compression code can be shared with crypto.
The basic approach for compression would be that X pages can be compressed into
small number of pages, Y, which can be a X to Y mapping.
But, this per-file encryption supports only 1 to 1 4KB mapping, so that it could
be quite a simple implementation.

Could you elaborate on your approach or design? Or, codes?
Whatever, IMO, it needs to implement it by any filesystem first.

Thanks,

> 
> On 05/13/2015 06:56 PM, Jaegeuk Kim wrote:
> >On Thu, May 14, 2015 at 10:37:21AM +1000, Dave Chinner wrote:
> >>On Tue, May 12, 2015 at 11:48:02PM -0700, Jaegeuk Kim wrote:
> >>>On Wed, May 13, 2015 at 12:02:08PM +1000, Dave Chinner wrote:
> >>>>On Fri, May 08, 2015 at 09:20:38PM -0700, Jaegeuk Kim wrote:
> >>>>>This definitions will be used by inode and superblock for encyption.
> >>>>How much of this crypto stuff is common with or only slightly
> >>>>modified from the ext4 code?  Is the behaviour and features the
> >>>>same? Is the user API and management tools the same?
> >>>>
> >>>>IMO, if there is any amount of overlap, then we should be
> >>>>implementing this stuff as generic code, not propagating the same
> >>>>code through multiple filesystems via copy-n-paste-n-modify. This
> >>>>will simply end up with diverging code, different bugs and feature
> >>>>sets, and none of the implementations will get the review and
> >>>>maintenance they really require...
> >>>>
> >>>>And, FWIW, this is the reason why I originally asked for the ext4
> >>>>encryption code to be pulled up to the VFS: precisely so we didn't
> >>>>end up with a rapid proliferation of individual in-filesystem
> >>>>encryption implementations that are all slightly different...
> >>>Totally agreed!
> >>>
> >>>AFAIK, Ted wants to push the codes as a crypto library into fs/ finally, so
> >>>I believe most part of crypto codes are common.
> >>Can I suggest fs/crypto/ if there are going to be multiple files?
> >No problem at all. I'll do.
> >
> >>>But, in order to realize that quickly, Ted implemented the feature to finalize
> >>>on-disk and in-memory design in EXT4 as a first step.
> >>>Then, I've been catching up and validating its design by implementing it in
> >>>F2FS, which also intends to figure out what crypto codes can be exactly common.
> >>Excellent. That will make it easier and less error prone for other
> >>filesystems to implement it, too!
> >>
> >>>As Ted mentioned before, since next android version tries to use per-file
> >>>encryption, F2FS also needs to support it as quick as possible likewise EXT4.
> >>Fair enough.
> >>
> >>>Meanwhile, surely I've been working on writing patches to push them into fs/;
> >>>currenlty, I did for cryto.c and will do for crypto_key.c and crypto_fname.c.
> >>>But, it needs to think about crypto_policy.c differently, since it may depend
> >>>on how each filesystem stores the policy information respectively; we cannot
> >>>push all the filesystems should use xattrs, right?
> >>All filesystems likely to implement per-file crypto support xattrs,
> >>and this is exactly what xattrs are designed for. e.g. we already
> >>require xattrs for generic security labels, ACLs, etc. Hence
> >>per-file crypto information should also use a common, shared xattr
> >>format. That way it only needs to be implemented once in the generic
> >>code and there's very little (hopefully nothing!) each filesystem
> >>has to customise to store the crypto information for each file.
> >Ok, I see. Let me take a look at that too.
> >Thank you for sharing your thoughts. :)
> >
> >>Cheers,
> >>
> >>Dave.
> >>-- 
> >>Dave Chinner
> >>david@fromorbit.com
> >--
> >To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Marshall May 16, 2015, 4:47 a.m. UTC | #7
On Fri, May 15, 2015 at 06:14:24PM -0700, Jaegeuk Kim wrote:
> On Thu, May 14, 2015 at 09:50:44AM -0700, Tom Marshall wrote:
> > Please keep in mind that I'm also working on transparent
> > compression.  I'm watching this thread closely so that I can
> > implement a compression library alongside the crypto library.  If
> > there is any interest or benefit, I would be glad to work together
> > so that the two can be done cooperatively at the same time.
> 
> I can't imagine quickly how compression code can be shared with crypto.
> The basic approach for compression would be that X pages can be compressed into
> small number of pages, Y, which can be a X to Y mapping.
> But, this per-file encryption supports only 1 to 1 4KB mapping, so that it could
> be quite a simple implementation.

No, I don't intend to share actual code with crypto -- at least not much. 
I'm more interested in looking at how the crypto layer is implemented to
give me clues about how to implement a compression layer.

> Could you elaborate on your approach or design? Or, codes?
> Whatever, IMO, it needs to implement it by any filesystem first.

I don't really have any working code yet.  I will probably get to that in
the coming few weeks.  Right now I'm still working with the ugly VFS
stacking implementation that I posted initially.

The thing that I have done is dismissed the standard compression framing
formats.

zlib (and gzip) are designed for streaming and it is quite difficult to
implement random access on it.  See the example code in the zlib source,
zran.c.  It's not really tenable because 32kb of prior data is required to
be kept as priming information.  Even doing fully encapsulated blocks with
Z_FINISH, there is still no way to skip over data without decompressing it
first to build an index.

lz4 is somewhat better in that blocks are self contained.  But block lengths
must be read sequentially.  This means that reading an arbitrary position in
a file requires a proportional number of reads to find the desired block.

So, I am working with a simple framing format that I threw together.  The
header has a compression method (zlib or lz4), block size, original input
size, and a block map.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o May 16, 2015, 1:24 p.m. UTC | #8
On Thu, May 14, 2015 at 10:37:21AM +1000, Dave Chinner wrote:
> > 
> > AFAIK, Ted wants to push the codes as a crypto library into fs/ finally, so
> > I believe most part of crypto codes are common.
> 
> Can I suggest fs/crypto/ if there are going to be multiple files?

Yes, I think we definitely want to do this as fs/crypto; it makes
things a bit more straightforward on a number of fronts; not just from
a Makefile point of view, but also in setting up a separate tree to
push into linux-next.

> > But, in order to realize that quickly, Ted implemented the feature to finalize
> > on-disk and in-memory design in EXT4 as a first step.
> > Then, I've been catching up and validating its design by implementing it in
> > F2FS, which also intends to figure out what crypto codes can be exactly common.
> 
> Excellent. That will make it easier and less error prone for other
> filesystems to implement it, too!

Yeah, that's why I figured it would be easier to get it into ext4 and
f2fs first.  We can keep the functions in sync for a release or so,
and then we can start creating patches that gradually move each of
fs/{ext4,f2fs}/{crypto,crypo_fname,crypto_policy}.c, etc. into a
common fs/crypto directory.  I'm sure there will be a bit of
refactoring that will be necessary as we go along, but it will be a
lot easier to validate that we've gotten things once we have to file
systems both depending on the common code.

If Michael and I had tried to create fs/cryto up-front, I'm *sure* we
would have gotten a number of things wrong.  :-)

> > Meanwhile, surely I've been working on writing patches to push them into fs/;
> > currenlty, I did for cryto.c and will do for crypto_key.c and crypto_fname.c.
> > But, it needs to think about crypto_policy.c differently, since it may depend
> > on how each filesystem stores the policy information respectively; we cannot
> > push all the filesystems should use xattrs, right?
> 
> All filesystems likely to implement per-file crypto support xattrs,
> and this is exactly what xattrs are designed for. e.g. we already
> require xattrs for generic security labels, ACLs, etc. Hence
> per-file crypto information should also use a common, shared xattr
> format. That way it only needs to be implemented once in the generic
> code and there's very little (hopefully nothing!) each filesystem
> has to customise to store the crypto information for each file.

Yes, absolutely.  What's going to be more difficult in the long run is
when we want to support per-block data integrity, using (for example)
AES/GCM, which will require 32 bytes of per-block "authentication tag"
(think Message Authentication Code) that have to be committed
atomically with the data block write.

Quite frankly, this is going to be much easier for COW file systems
like f2fs and btrfs.  I have some ideas about how to do this for
update-in-place file systems (using either a compression hack and/or
storing two generations worth of authentication tag per block), but
it's not pretty.  (Or in the case of ext4 we could use
data=journalling mode, but that's even less pretty from a performance
standpoint.)

Cheers,

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Marshall May 16, 2015, 5:13 p.m. UTC | #9
On Sat, May 16, 2015 at 09:24:03AM -0400, Theodore Ts'o wrote:
> [...] What's going to be more difficult in the long run is
> when we want to support per-block data integrity, using (for example)
> AES/GCM, which will require 32 bytes of per-block "authentication tag"
> (think Message Authentication Code) that have to be committed
> atomically with the data block write.
> 
> Quite frankly, this is going to be much easier for COW file systems
> like f2fs and btrfs.  I have some ideas about how to do this for
> update-in-place file systems (using either a compression hack and/or
> storing two generations worth of authentication tag per block), but
> it's not pretty.  (Or in the case of ext4 we could use
> data=journalling mode, but that's even less pretty from a performance
> standpoint.)

It seems to me that compression has a similar issue with framing metadata. 
The block map can be stored upfront, which then would require two writes to
update a data block.  Or it can be stored inline, which then would require
scanning through the file sequentially for random access.

The big difference, of course, is that in the case of compression, we have
the luxury of assuming or mandating read-only/read-mostly.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jaegeuk Kim May 18, 2015, 6:24 a.m. UTC | #10
On Fri, May 15, 2015 at 09:47:05PM -0700, Tom Marshall wrote:
> On Fri, May 15, 2015 at 06:14:24PM -0700, Jaegeuk Kim wrote:
> > On Thu, May 14, 2015 at 09:50:44AM -0700, Tom Marshall wrote:
> > > Please keep in mind that I'm also working on transparent
> > > compression.  I'm watching this thread closely so that I can
> > > implement a compression library alongside the crypto library.  If
> > > there is any interest or benefit, I would be glad to work together
> > > so that the two can be done cooperatively at the same time.
> > 
> > I can't imagine quickly how compression code can be shared with crypto.
> > The basic approach for compression would be that X pages can be compressed into
> > small number of pages, Y, which can be a X to Y mapping.
> > But, this per-file encryption supports only 1 to 1 4KB mapping, so that it could
> > be quite a simple implementation.
> 
> No, I don't intend to share actual code with crypto -- at least not much. 
> I'm more interested in looking at how the crypto layer is implemented to
> give me clues about how to implement a compression layer.

Ok, I see.

Currently, I've been writing up fs/crypto having shared crypto codes between
ext4 and f2fs; I refactored existing codes a little bit though.

I'm approaching to introduce a fscypt_operations for each filesystems like this.

struct fscrypt_operations {
	int (*get_context)(struct inode *, void *, size_t, void *);
	int (*set_context)(struct inode *, const void *, size_t, int, void *);
	int (*prepare_new_context)(struct inode *);
	bool (*is_encrypted)(struct inode *);
	bool (*empty_dir)(struct inode *);
	unsigned (*max_namelen)(struct inode *);
};

And, the following two basic functions will be used by filesystems:
fscrypt_encrypt_page() and fscrypt_decrypt_page().

> 
> > Could you elaborate on your approach or design? Or, codes?
> > Whatever, IMO, it needs to implement it by any filesystem first.
> 
> I don't really have any working code yet.  I will probably get to that in
> the coming few weeks.  Right now I'm still working with the ugly VFS
> stacking implementation that I posted initially.
> 
> The thing that I have done is dismissed the standard compression framing
> formats.
> 
> zlib (and gzip) are designed for streaming and it is quite difficult to
> implement random access on it.  See the example code in the zlib source,
> zran.c.  It's not really tenable because 32kb of prior data is required to
> be kept as priming information.  Even doing fully encapsulated blocks with
> Z_FINISH, there is still no way to skip over data without decompressing it
> first to build an index.
> 
> lz4 is somewhat better in that blocks are self contained.  But block lengths
> must be read sequentially.  This means that reading an arbitrary position in
> a file requires a proportional number of reads to find the desired block.
> 
> So, I am working with a simple framing format that I threw together.  The
> header has a compression method (zlib or lz4), block size, original input
> size, and a block map.

Thank you for sharing the approach, and it makes sense to improve random
read performance.

Thanks,
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Marshall May 20, 2015, 5:46 p.m. UTC | #11
So I've been playing around a bit and I have a basic strategy laid out.
Please let me know if I'm on the right track.

Compressed file attributes
==========================

The filesystem is responsible for detecting whether a file is compressed and
hooking into the compression lib.  This may be done with an inode flag,
xattr, or any other applicable method.  No other special attributes are
necessary.

Compressed file format
======================

Compressed files shall have header, block map, and data sections.

Header:

byte[4]		magic		'zzzz' (not strictly needed)
byte		param1		method and flags
	bits 0..3 = compression method (1=zlib, 2=lz4, etc.)
	bits 4..7 = flags (none defined yet)
byte		blocksize	log2 of blocksize (max 31)
le48		orig_size	original uncompressed file size


Block map:

Vector of le16 (if blocksize <= 16) or le32 (if blocksize > 16).  Each entry
is the compressed size of the block.  Zero indicates that the block is
stored uncompressed, in case compression expanded the block.

Data:

Compressed data.

Compression library
===================

I'm just fleshing this out and trying to make it work.  The basic strategy
is to wrap the underlying readpage/readpages.  Right now I have the
following:

/*
 * For convenience.
 */
typedef int (*readpage_t)(struct page *);

/*
 * Indicate whether compression is enabled.  It may be desirable to disable
 * compression for test, backup, or maintenance activities.  Controlled by
 * a sysfs file.
 */
extern int zfile_enabled(void);

/*
 * Initialize internal data structures for a given inode.  This will
 * result in reading the file header and block map, so the inode must
 * be fully populated and ready to accept readpage requests.
 */
extern int zinode_init(struct inode *inode, readpage_t lower_readpage);

/*
 * Wrapper for filesystem's readpage.  Consults the block map and reads
 * the appropriate compressed pages for the requested block, decompresses
 * them, and populates the page with uncompressed data.
 */
extern int zinode_readpage(readpage_t lower_readpage,
		struct page *page);

/*
 * As above, but for multiple pages.
 */
extern int zinode_readpages(readpage_t lower_readpage,
		struct address_space *mapping,
		struct list_head *pages, unsigned nr_pages);

Questions and issues
====================

Should there be any padding for the data blocks?  For example, if writing is
to be supported, padding the compressed data to the filesystem block size
would allow for easy rewriting of individual blocks without disturbing the
surrounding blocks.  Perhaps padding could be indicated by a flag.

Note readpage_t does not take a file ptr, as it's not available at inode
creation time.  ext4 only uses it to lookup the inode ptr.  Is this true
across all filesystems, or do some actually require the file ptr in their
readpage?

The compression code must be able to read pages from the underlying
filesystem.  This involves using the pagecache.  But the uncompressed data
is what ultimately should end up in the pagecache.  This is where I'm
currently stuck.  How do I implement the code such that the underlying
compressed data may be read (using the pagecache or not) while not
disturbing the pagecache for the uncompressed data?  I'm wondering if I need
to create an internal address_space to pass down into the underlying
readpage?  Or is there another way to do this?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Marshall May 20, 2015, 7:50 p.m. UTC | #12
> The compression code must be able to read pages from the underlying
> filesystem.  This involves using the pagecache.  But the uncompressed data
> is what ultimately should end up in the pagecache.  This is where I'm
> currently stuck.  How do I implement the code such that the underlying
> compressed data may be read (using the pagecache or not) while not
> disturbing the pagecache for the uncompressed data?  I'm wondering if I need
> to create an internal address_space to pass down into the underlying
> readpage?  Or is there another way to do this?

I created a private address_space for the inode that is passed into the
lower readpage.  That seems to be working.  Is this a cool thing to do
(hopefully)?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o May 20, 2015, 9:36 p.m. UTC | #13
On Wed, May 20, 2015 at 10:46:35AM -0700, Tom Marshall wrote:
> So I've been playing around a bit and I have a basic strategy laid out.
> Please let me know if I'm on the right track.
> 
> Compressed file attributes
> ==========================
> 
> The filesystem is responsible for detecting whether a file is compressed and
> hooking into the compression lib.  This may be done with an inode flag,
> xattr, or any other applicable method.  No other special attributes are
> necessary.

So I assume what you are implementing is read-only compression; that
is, once the file is written, and the attribute set indicating that
this is a compressed file, it is now immutable.

> Compressed file format
> ======================
> 
> Compressed files shall have header, block map, and data sections.
> 
> Header:
> 
> byte[4]		magic		'zzzz' (not strictly needed)
> byte		param1		method and flags
> 	bits 0..3 = compression method (1=zlib, 2=lz4, etc.)
> 	bits 4..7 = flags (none defined yet)
> byte		blocksize	log2 of blocksize (max 31)

I suggest using the term "compression cluster" to distinguish this
from the file system block size.

> le48		orig_size	original uncompressed file size
> 
> 
> Block map:
> 
> Vector of le16 (if blocksize <= 16) or le32 (if blocksize > 16).  Each entry
> is the compressed size of the block.  Zero indicates that the block is
> stored uncompressed, in case compression expanded the block.

What I would store instead is list of 32 or 64-bit offsets, where the
nth entry in the array indicates the starting offset of the nth
compression cluster.

> Questions and issues
 ====================
> 
> Should there be any padding for the data blocks?  For example, if writing is
> to be supported, padding the compressed data to the filesystem block size
> would allow for easy rewriting of individual blocks without disturbing the
> surrounding blocks.  Perhaps padding could be indicated by a flag.

If you add padding then you defeat the whole point of adding
compression.  What if the initial contents of a 64k cluster was all
zeros, so it trivially compresses down to a few dozen bytes; but then
it gets replaced by completely uncompressible data?  If you add 64k
worth of padding to each block, then you're not saving any space, so
what's the point?

> The compression code must be able to read pages from the underlying
> filesystem.  This involves using the pagecache.  But the uncompressed data
> is what ultimately should end up in the pagecache.  This is where I'm
> currently stuck.  How do I implement the code such that the underlying
> compressed data may be read (using the pagecache or not) while not
> disturbing the pagecache for the uncompressed data?  I'm wondering if I need
> to create an internal address_space to pass down into the underlying
> readpage?  Or is there another way to do this?

So I would *not* reference the compressed data via the page cache.  If
you do that, then you end up wasting space in the page cache, since
the page cache will contain both the compressed and decompressed data
--- and once the data has been decompressed, the compressed version is
completely useless.  So it's better to have the file system supply the
physical location on disk, and then to read in the compressed data to
a scratched set of page which is freed immediately after you are done
decompressing things.

This is why compression is so very different from encryption.  The
constraints make it quite different.

Regards,

						- Ted
						

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Marshall May 20, 2015, 10:46 p.m. UTC | #14
So I have this all working as described.  I haven't implemented readahead
yet (readpages) so it's slow.  I'll be doing that next.

The other thing to note is that since the uncompressed size is stored inside
the file data, stat() requires reading both the inode and the first page of
the file.  That's not optimal, but I don't know if other generic out-of-band
solutions (such as xattrs) would be any better.  I suppose it depends on
whether the xattr info is read in with the inode or not.

Also, on the subject of file size, I'm currently swapping out the i_size for
the compressed i_size before calling down into the filesystem's readpage. 
Yes that's a nasty hack that I'll need to address.

On Wed, May 20, 2015 at 05:36:41PM -0400, Theodore Ts'o wrote:
> On Wed, May 20, 2015 at 10:46:35AM -0700, Tom Marshall wrote:
> > So I've been playing around a bit and I have a basic strategy laid out.
> > Please let me know if I'm on the right track.
> > 
> > Compressed file attributes
> > ==========================
> > 
> > The filesystem is responsible for detecting whether a file is compressed and
> > hooking into the compression lib.  This may be done with an inode flag,
> > xattr, or any other applicable method.  No other special attributes are
> > necessary.
> 
> So I assume what you are implementing is read-only compression; that
> is, once the file is written, and the attribute set indicating that
> this is a compressed file, it is now immutable.

That is TBD.  Our use case right now only requires read-only, but I think
read-write would be a nice thing if it's not too convoluted.

fallocate() is supported on the major filesystems, and I imagine the same
mechanisms could be used to provide rewriting of the "compression clusters".

> > Compressed file format
> > ======================
> > 
> > Compressed files shall have header, block map, and data sections.
> > 
> > Header:
> > 
> > byte[4]		magic		'zzzz' (not strictly needed)
> > byte		param1		method and flags
> > 	bits 0..3 = compression method (1=zlib, 2=lz4, etc.)
> > 	bits 4..7 = flags (none defined yet)
> > byte		blocksize	log2 of blocksize (max 31)
> 
> I suggest using the term "compression cluster" to distinguish this
> from the file system block size.

Sure, just a name...

> > le48		orig_size	original uncompressed file size
> > 
> > 
> > Block map:
> > 
> > Vector of le16 (if blocksize <= 16) or le32 (if blocksize > 16).  Each entry
> > is the compressed size of the block.  Zero indicates that the block is
> > stored uncompressed, in case compression expanded the block.
> 
> What I would store instead is list of 32 or 64-bit offsets, where the
> nth entry in the array indicates the starting offset of the nth
> compression cluster.

Why?  This would both increase the space requirements and require some other
mechanism to indicate uncompressed compression clusters (eg. setting the
high bit or something).

> > Questions and issues
>  ====================
> > 
> > Should there be any padding for the data blocks?  For example, if writing is
> > to be supported, padding the compressed data to the filesystem block size
> > would allow for easy rewriting of individual blocks without disturbing the
> > surrounding blocks.  Perhaps padding could be indicated by a flag.
> 
> If you add padding then you defeat the whole point of adding
> compression.  What if the initial contents of a 64k cluster was all
> zeros, so it trivially compresses down to a few dozen bytes; but then
> it gets replaced by completely uncompressible data?  If you add 64k
> worth of padding to each block, then you're not saving any space, so
> what's the point?

Sorry, I meant padding to the filesystem block size.  So, for example, if a
64kb compression cluster is compressed to 31kb, it would use 8*4kb blocks
and the next compression cluster would start on a new block.

> > The compression code must be able to read pages from the underlying
> > filesystem.  This involves using the pagecache.  But the uncompressed data
> > is what ultimately should end up in the pagecache.  This is where I'm
> > currently stuck.  How do I implement the code such that the underlying
> > compressed data may be read (using the pagecache or not) while not
> > disturbing the pagecache for the uncompressed data?  I'm wondering if I need
> > to create an internal address_space to pass down into the underlying
> > readpage?  Or is there another way to do this?
> 
> So I would *not* reference the compressed data via the page cache.  If
> you do that, then you end up wasting space in the page cache, since
> the page cache will contain both the compressed and decompressed data
> --- and once the data has been decompressed, the compressed version is
> completely useless.  So it's better to have the file system supply the
> physical location on disk, and then to read in the compressed data to
> a scratched set of page which is freed immediately after you are done
> decompressing things.

I'm currently using an internal address_space to pass down into the
underlying filesystem to make things easy.  I'm too inexperienced in
filesystem development to unravel how to plumb in anything else (but I'm
learning quickly!)

If you have ideas about how to do the underlying readpage without the
pagecache, please enlighten me.

Or perhaps I could manually release the page from the private cache after
the uncompressed data has been extracted?

> 
> This is why compression is so very different from encryption.  The
> constraints make it quite different.
> 
> Regards,
> 
> 						- Ted
> 						
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Marshall May 21, 2015, 4:28 a.m. UTC | #15
On Wed, May 20, 2015 at 03:46:30PM -0700, Tom Marshall wrote:
> So I have this all working as described.  I haven't implemented readahead
> yet (readpages) so it's slow.  I'll be doing that next.
> 
> The other thing to note is that since the uncompressed size is stored inside
> the file data, stat() requires reading both the inode and the first page of
> the file.  That's not optimal, but I don't know if other generic out-of-band
> solutions (such as xattrs) would be any better.  I suppose it depends on
> whether the xattr info is read in with the inode or not.
> 
> Also, on the subject of file size, I'm currently swapping out the i_size for
> the compressed i_size before calling down into the filesystem's readpage. 
> Yes that's a nasty hack that I'll need to address.
> 
> On Wed, May 20, 2015 at 05:36:41PM -0400, Theodore Ts'o wrote:
> > On Wed, May 20, 2015 at 10:46:35AM -0700, Tom Marshall wrote:
> > > So I've been playing around a bit and I have a basic strategy laid out.
> > > Please let me know if I'm on the right track.
> > > 
> > > Compressed file attributes
> > > ==========================
> > > 
> > > The filesystem is responsible for detecting whether a file is compressed and
> > > hooking into the compression lib.  This may be done with an inode flag,
> > > xattr, or any other applicable method.  No other special attributes are
> > > necessary.
> > 
> > So I assume what you are implementing is read-only compression; that
> > is, once the file is written, and the attribute set indicating that
> > this is a compressed file, it is now immutable.
> 
> That is TBD.  Our use case right now only requires read-only, but I think
> read-write would be a nice thing if it's not too convoluted.
> 
> fallocate() is supported on the major filesystems, and I imagine the same
> mechanisms could be used to provide rewriting of the "compression clusters".

It just occurred to me that this could be the answer to both a consistent
i_size and providing read-write access.

If compression clusters were implemented by holes in the underlying file and
the header+blockmap were not stored in the file proper, then the compressed
and uncompressed file sizes would match.

Further, rewriting a compression cluster would be a simple matter of
adjusting the allocated compression cluster blocks as fallocate() provides.

I honestly don't have a clue how to storing the header+blockmap outside the
file yet, but if we could, that would seem to tie everything together quite
nicely.

Thoughts?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Marshall May 27, 2015, 6:53 p.m. UTC | #16
As I mentioned earlier, I had a hack working for transparent file 
compression.  But it has a few problems.  The two major ones are (1) It 
reads the lower pages synchronously by waiting on pages with 
wait_on_page_locked(), destroying readahead, and (2) it is swapping out 
i_size to call the lower readpage.  So I'm ditching that and going 
another route.

My idea now is to create a wrapper inode that is passed back to the VFS 
layer.  The wrapper inode would have the correct i_size, blkbits, and 
implement address_space_operations.  The implementation of 
readpage/readpages would then call down into the lower filesystem's 
readpage/readpages to fetch compressed data.  When the required pages 
are available, it would decompress and populate its own requested pages.

But one thing I'm wrestling with is how to be asynchronously notified 
when the lower readpage/readpages complete.  The two ideas that come to 
mind are (1) plumbing a callback into mpage_end_io(), (2) allowing 
override of mpage_end_io() with a custom function, (3) creating kernel 
threads analogous to kblockd to wait for pending pages.

But again, I'm new at this, so maybe there's an easier or better way?

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o May 27, 2015, 11:38 p.m. UTC | #17
On Wed, May 27, 2015 at 11:53:17AM -0700, Tom Marshall wrote:
> But one thing I'm wrestling with is how to be asynchronously notified when
> the lower readpage/readpages complete.  The two ideas that come to mind are
> (1) plumbing a callback into mpage_end_io(), (2) allowing override of
> mpage_end_io() with a custom function, (3) creating kernel threads analogous
> to kblockd to wait for pending pages.

Not all file systems use mpage_end_io(), so that's not a good
solution.

You can do something like

	wait_on_page_bit(page, PG_uptodate);

... although to be robust you will also need to wake up if PG_error is
set (if there is an I/O error, PG_error is set instead of
PG_uptodate).  So that means you'd have to spin your own wait function
using the waitqueue primitives and page_waitqueue(), using
__wait_on_bit() as an initial model.

This suggestion should not be taken as an endorsement of your
higher-levle architecture.  I suggest you think very carefully about
whether or not you need to be able to support random write
functionality, and if you don't, there are simpler ways such as the
one I outlined to you earlier.

Regards, and good luck,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Marshall May 28, 2015, 12:20 a.m. UTC | #18
On Wed, May 27, 2015 at 07:38:00PM -0400, Theodore Ts'o wrote:
> On Wed, May 27, 2015 at 11:53:17AM -0700, Tom Marshall wrote:
> > But one thing I'm wrestling with is how to be asynchronously notified when
> > the lower readpage/readpages complete.  The two ideas that come to mind are
> > (1) plumbing a callback into mpage_end_io(), (2) allowing override of
> > mpage_end_io() with a custom function, (3) creating kernel threads analogous
> > to kblockd to wait for pending pages.
> 
> Not all file systems use mpage_end_io(), so that's not a good
> solution.

Ah, thanks, I was not aware of that.  So that leaves waiting on pages, which
probably means a fair amount of plumbing to do correctly.

> You can do something like
> 
> 	wait_on_page_bit(page, PG_uptodate);
> 
> ... although to be robust you will also need to wake up if PG_error is
> set (if there is an I/O error, PG_error is set instead of
> PG_uptodate).  So that means you'd have to spin your own wait function
> using the waitqueue primitives and page_waitqueue(), using
> __wait_on_bit() as an initial model.

Right, that should be pretty easy.

> This suggestion should not be taken as an endorsement of your
> higher-levle architecture.  I suggest you think very carefully about
> whether or not you need to be able to support random write
> functionality, and if you don't, there are simpler ways such as the
> one I outlined to you earlier.

I recall this:

> [...] So it's better to have the file system supply the physical location on
> disk, and then to read in the compressed data to a scratched set of page
> which is freed immediately after you are done decompressing things.

Is that what you're referring to?

If so, I'm not seeing how this makes things simpler.  It's still
asynchronous, right?  ext4_readpage calls back into mpage_readpage which
uses ext4_get_block, which then queues a bio request.  I don't see any way
to avoid queueing asynchronous bio requests or even getting completion
notifications.

Now, I could pass ext4_get_block up into my code and setup my own bio
requests so that I can get callbacks.  But this basically means implementing
the equivalent of do_mpage_readpage in my own code, and that's not really
trivial code to copy/paste/hack.  And it also doesn't address filesystems
that don't use mpage_end_io(), right?

Am I missing something?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Marshall May 28, 2015, 8:55 p.m. UTC | #19
On Wed, May 27, 2015 at 07:38:00PM -0400, Theodore Ts'o wrote:
> On Wed, May 27, 2015 at 11:53:17AM -0700, Tom Marshall wrote:
> > But one thing I'm wrestling with is how to be asynchronously notified when
> > the lower readpage/readpages complete.  The two ideas that come to mind are
> > (1) plumbing a callback into mpage_end_io(), (2) allowing override of
> > mpage_end_io() with a custom function, (3) creating kernel threads analogous
> > to kblockd to wait for pending pages.
> 
> Not all file systems use mpage_end_io(), so that's not a good
> solution.
> 
> You can do something like
> 
> 	wait_on_page_bit(page, PG_uptodate);
> 
> ... although to be robust you will also need to wake up if PG_error is
> set (if there is an I/O error, PG_error is set instead of
> PG_uptodate).  So that means you'd have to spin your own wait function
> using the waitqueue primitives and page_waitqueue(), using
> __wait_on_bit() as an initial model.

My current thought is to use a workqueue and queue a work object for each
cluster read that is in flight.  The work function does wait_on_page_locked
for each lower page.  If/when all complete without error, the data is then
decompressed and the upper pages are entered.  This is currently a work in
progress, it's not yet functional.

So basically my questions here are:

(1) Does this seem sane?

(2) Is it ok to wait on !locked instead of waiting on (uptodate|error)?

(3) Do I need to mark a pending cluster pending to prevent simultaneous
    reads of the same cluster (especially since the cluster size will be
    much larger than the lower block size)?

Also note I've now realized that the logical conclusion to the wrapper inode
idea is a stacked filesystem.  That's not the direction we are aiming for,
so instead I'm adding a "struct xcomp_inode_info" to ext4_inode_info, which
can then be passed back into the xcomp functions.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Marshall May 29, 2015, 12:18 a.m. UTC | #20
So I've just gotten this all working.  The last notable change I made was to
inode size: I added an i_compressed_size member and then did some macro
hackage to ensure that the fs implementation (eg. fs/ext4) sees the
compressed size while everything else sees the uncompressed size.

I'll be testing further tomorrow.

On Thu, May 28, 2015 at 01:55:15PM -0700, Tom Marshall wrote:
> On Wed, May 27, 2015 at 07:38:00PM -0400, Theodore Ts'o wrote:
> > On Wed, May 27, 2015 at 11:53:17AM -0700, Tom Marshall wrote:
> > > But one thing I'm wrestling with is how to be asynchronously notified when
> > > the lower readpage/readpages complete.  The two ideas that come to mind are
> > > (1) plumbing a callback into mpage_end_io(), (2) allowing override of
> > > mpage_end_io() with a custom function, (3) creating kernel threads analogous
> > > to kblockd to wait for pending pages.
> > 
> > Not all file systems use mpage_end_io(), so that's not a good
> > solution.
> > 
> > You can do something like
> > 
> > 	wait_on_page_bit(page, PG_uptodate);
> > 
> > ... although to be robust you will also need to wake up if PG_error is
> > set (if there is an I/O error, PG_error is set instead of
> > PG_uptodate).  So that means you'd have to spin your own wait function
> > using the waitqueue primitives and page_waitqueue(), using
> > __wait_on_bit() as an initial model.
> 
> My current thought is to use a workqueue and queue a work object for each
> cluster read that is in flight.  The work function does wait_on_page_locked
> for each lower page.  If/when all complete without error, the data is then
> decompressed and the upper pages are entered.  This is currently a work in
> progress, it's not yet functional.
> 
> So basically my questions here are:
> 
> (1) Does this seem sane?
> 
> (2) Is it ok to wait on !locked instead of waiting on (uptodate|error)?
> 
> (3) Do I need to mark a pending cluster pending to prevent simultaneous
>     reads of the same cluster (especially since the cluster size will be
>     much larger than the lower block size)?
> 
> Also note I've now realized that the logical conclusion to the wrapper inode
> idea is a stacked filesystem.  That's not the direction we are aiming for,
> so instead I'm adding a "struct xcomp_inode_info" to ext4_inode_info, which
> can then be passed back into the xcomp functions.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 477e65f..c3c4deb 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -70,6 +70,8 @@  struct f2fs_mount_info {
 	unsigned int	opt;
 };
 
+#define F2FS_FEATURE_ENCRYPT	0x0001
+
 #define F2FS_HAS_FEATURE(sb, mask)					\
 	((F2FS_SB(sb)->raw_super->feature & cpu_to_le32(mask)) != 0)
 #define F2FS_SET_FEATURE(sb, mask)					\
@@ -346,6 +348,7 @@  struct f2fs_map_blocks {
  */
 #define FADVISE_COLD_BIT	0x01
 #define FADVISE_LOST_PINO_BIT	0x02
+#define FADVISE_ENCRYPT_BIT	0x04
 
 #define file_is_cold(inode)	is_file(inode, FADVISE_COLD_BIT)
 #define file_wrong_pino(inode)	is_file(inode, FADVISE_LOST_PINO_BIT)
@@ -353,6 +356,16 @@  struct f2fs_map_blocks {
 #define file_lost_pino(inode)	set_file(inode, FADVISE_LOST_PINO_BIT)
 #define file_clear_cold(inode)	clear_file(inode, FADVISE_COLD_BIT)
 #define file_got_pino(inode)	clear_file(inode, FADVISE_LOST_PINO_BIT)
+#define file_is_encrypt(inode)	is_file(inode, FADVISE_ENCRYPT_BIT)
+#define file_set_encrypt(inode)	set_file(inode, FADVISE_ENCRYPT_BIT)
+#define file_clear_encrypt(inode) clear_file(inode, FADVISE_ENCRYPT_BIT)
+
+/* Encryption algorithms */
+#define F2FS_ENCRYPTION_MODE_INVALID		0
+#define F2FS_ENCRYPTION_MODE_AES_256_XTS	1
+#define F2FS_ENCRYPTION_MODE_AES_256_GCM	2
+#define F2FS_ENCRYPTION_MODE_AES_256_CBC	3
+#define F2FS_ENCRYPTION_MODE_AES_256_CTS	4
 
 #define DEF_DIR_LEVEL		0
 
@@ -380,6 +393,11 @@  struct f2fs_inode_info {
 	struct radix_tree_root inmem_root;	/* radix tree for inmem pages */
 	struct list_head inmem_pages;	/* inmemory pages managed by f2fs */
 	struct mutex inmem_lock;	/* lock for inmemory pages */
+
+#ifdef CONFIG_F2FS_FS_ENCRYPTION
+	/* Encryption params */
+	struct f2fs_crypt_info *i_crypt_info;
+#endif
 };
 
 static inline void get_extent_info(struct extent_info *ext,
@@ -1891,4 +1909,40 @@  void f2fs_delete_inline_entry(struct f2fs_dir_entry *, struct page *,
 						struct inode *, struct inode *);
 bool f2fs_empty_inline_dir(struct inode *);
 int f2fs_read_inline_dir(struct file *, struct dir_context *);
+
+/*
+ * crypto support
+ */
+static inline int f2fs_encrypted_inode(struct inode *inode)
+{
+#ifdef CONFIG_F2FS_FS_ENCRYPTION
+	return file_is_encrypt(inode);
+#else
+	return 0;
+#endif
+}
+
+static inline void f2fs_set_encrypted_inode(struct inode *inode)
+{
+#ifdef CONFIG_F2FS_FS_ENCRYPTION
+	file_set_encrypt(inode);
+#endif
+}
+
+static inline bool f2fs_bio_encrypted(struct bio *bio)
+{
+#ifdef CONFIG_F2FS_FS_ENCRYPTION
+	return unlikely(bio->bi_private != NULL);
+#else
+	return false;
+#endif
+}
+
+static inline int f2fs_sb_has_crypto(struct super_block *sb)
+{
+#ifdef CONFIG_F2FS_FS_ENCRYPTION
+	return F2FS_HAS_FEATURE(sb, F2FS_FEATURE_ENCRYPT);
+#else
+	return 0;
+#endif
 #endif
diff --git a/fs/f2fs/f2fs_crypto.h b/fs/f2fs/f2fs_crypto.h
new file mode 100644
index 0000000..cfc37c1
--- /dev/null
+++ b/fs/f2fs/f2fs_crypto.h
@@ -0,0 +1,149 @@ 
+/*
+ * linux/fs/f2fs/f2fs_crypto.h
+ *
+ * Copied from linux/fs/ext4/ext4_crypto.h
+ *
+ * Copyright (C) 2015, Google, Inc.
+ *
+ * This contains encryption header content for f2fs
+ *
+ * Written by Michael Halcrow, 2015.
+ * Modified by Jaegeuk Kim, 2015.
+ */
+#ifndef _F2FS_CRYPTO_H
+#define _F2FS_CRYPTO_H
+
+#include <linux/fs.h>
+
+#define F2FS_KEY_DESCRIPTOR_SIZE	8
+
+/* Policy provided via an ioctl on the topmost directory */
+struct f2fs_encryption_policy {
+	char version;
+	char contents_encryption_mode;
+	char filenames_encryption_mode;
+	char flags;
+	char master_key_descriptor[F2FS_KEY_DESCRIPTOR_SIZE];
+} __attribute__((__packed__));
+
+#define F2FS_ENCRYPTION_CONTEXT_FORMAT_V1	1
+#define F2FS_KEY_DERIVATION_NONCE_SIZE		16
+
+#define F2FS_POLICY_FLAGS_PAD_4		0x00
+#define F2FS_POLICY_FLAGS_PAD_8		0x01
+#define F2FS_POLICY_FLAGS_PAD_16	0x02
+#define F2FS_POLICY_FLAGS_PAD_32	0x03
+#define F2FS_POLICY_FLAGS_PAD_MASK	0x03
+#define F2FS_POLICY_FLAGS_VALID		0x03
+
+/**
+ * Encryption context for inode
+ *
+ * Protector format:
+ *  1 byte: Protector format (1 = this version)
+ *  1 byte: File contents encryption mode
+ *  1 byte: File names encryption mode
+ *  1 byte: Flags
+ *  8 bytes: Master Key descriptor
+ *  16 bytes: Encryption Key derivation nonce
+ */
+struct f2fs_encryption_context {
+	char format;
+	char contents_encryption_mode;
+	char filenames_encryption_mode;
+	char flags;
+	char master_key_descriptor[F2FS_KEY_DESCRIPTOR_SIZE];
+	char nonce[F2FS_KEY_DERIVATION_NONCE_SIZE];
+} __attribute__((__packed__));
+
+/* Encryption parameters */
+#define F2FS_XTS_TWEAK_SIZE 16
+#define F2FS_AES_128_ECB_KEY_SIZE 16
+#define F2FS_AES_256_GCM_KEY_SIZE 32
+#define F2FS_AES_256_CBC_KEY_SIZE 32
+#define F2FS_AES_256_CTS_KEY_SIZE 32
+#define F2FS_AES_256_XTS_KEY_SIZE 64
+#define F2FS_MAX_KEY_SIZE 64
+
+struct f2fs_encryption_key {
+	__u32 mode;
+	char raw[F2FS_MAX_KEY_SIZE];
+	__u32 size;
+} __attribute__((__packed__));
+
+struct f2fs_crypt_info {
+	unsigned char	ci_mode;
+	unsigned char	ci_size;
+	char		ci_data_mode;
+	char		ci_filename_mode;
+	char		ci_flags;
+	struct crypto_ablkcipher *ci_ctfm;
+	struct key	*ci_keyring_key;
+	char		ci_raw[F2FS_MAX_KEY_SIZE];
+	char		ci_master_key[F2FS_KEY_DESCRIPTOR_SIZE];
+};
+
+#define F2FS_CTX_REQUIRES_FREE_ENCRYPT_FL             0x00000001
+#define F2FS_BOUNCE_PAGE_REQUIRES_FREE_ENCRYPT_FL     0x00000002
+
+struct f2fs_crypto_ctx {
+	struct crypto_tfm *tfm;         /* Crypto API context */
+	struct page *bounce_page;       /* Ciphertext page on write path */
+	struct page *control_page;      /* Original page on write path */
+	struct bio *bio;                /* The bio for this context */
+	struct work_struct work;        /* Work queue for read complete path */
+	struct list_head free_list;     /* Free list */
+	int flags;                      /* Flags */
+	int mode;                       /* Encryption mode for tfm */
+};
+
+struct f2fs_completion_result {
+	struct completion completion;
+	int res;
+};
+
+#define DECLARE_F2FS_COMPLETION_RESULT(ecr) \
+	struct f2fs_completion_result ecr = { \
+		COMPLETION_INITIALIZER((ecr).completion), 0 }
+
+static inline int f2fs_encryption_key_size(int mode)
+{
+	switch (mode) {
+	case F2FS_ENCRYPTION_MODE_AES_256_XTS:
+		return F2FS_AES_256_XTS_KEY_SIZE;
+	case F2FS_ENCRYPTION_MODE_AES_256_GCM:
+		return F2FS_AES_256_GCM_KEY_SIZE;
+	case F2FS_ENCRYPTION_MODE_AES_256_CBC:
+		return F2FS_AES_256_CBC_KEY_SIZE;
+	case F2FS_ENCRYPTION_MODE_AES_256_CTS:
+		return F2FS_AES_256_CTS_KEY_SIZE;
+	default:
+		BUG();
+	}
+	return 0;
+}
+
+#define F2FS_FNAME_NUM_SCATTER_ENTRIES	4
+#define F2FS_CRYPTO_BLOCK_SIZE		16
+#define F2FS_FNAME_CRYPTO_DIGEST_SIZE	32
+
+/**
+ * For encrypted symlinks, the ciphertext length is stored at the beginning
+ * of the string in little-endian format.
+ */
+struct f2fs_encrypted_symlink_data {
+	__le16 len;
+	char encrypted_path[1];
+} __attribute__((__packed__));
+
+/**
+ * This function is used to calculate the disk space required to
+ * store a filename of length l in encrypted symlink format.
+ */
+static inline u32 encrypted_symlink_data_len(u32 l)
+{
+	if (l < F2FS_CRYPTO_BLOCK_SIZE)
+		l = F2FS_CRYPTO_BLOCK_SIZE;
+	return (l + sizeof(struct f2fs_encrypted_symlink_data) - 1);
+}
+#endif	/* _F2FS_CRYPTO_H */
diff --git a/include/linux/f2fs_fs.h b/include/linux/f2fs_fs.h
index d44e97f..920408a 100644
--- a/include/linux/f2fs_fs.h
+++ b/include/linux/f2fs_fs.h
@@ -91,7 +91,9 @@  struct f2fs_super_block {
 	__u8 version[VERSION_LEN];	/* the kernel version */
 	__u8 init_version[VERSION_LEN];	/* the initial kernel version */
 	__le32 feature;			/* defined features */
-	__u8 reserved[888];		/* valid reserved region */
+	__u8 encryption_level;		/* versioning level for encryption */
+	__u8 encrypt_pw_salt[16];	/* Salt used for string2key algorithm */
+	__u8 reserved[871];		/* valid reserved region */
 } __packed;
 
 /*