[RFC,0/5] Shared memory for shared extents

Message ID	cover.1634933121.git.rgoldwyn@suse.com (mailing list archive)
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Goldwyn Rodrigues <rgoldwyn@suse.de> To: linux-fsdevel@vger.kernel.org Cc: linux-btrfs@vger.kernel.org, Goldwyn Rodrigues <rgoldwyn@suse.com> Subject: [RFC PATCH 0/5] Shared memory for shared extents Date: Fri, 22 Oct 2021 15:15:00 -0500 Message-Id: <cover.1634933121.git.rgoldwyn@suse.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Shared memory for shared extents \| expand [RFC,0/5] Shared memory for shared extents [RFC,1/5] mm: Use file parameter to determine bdi [RFC,2/5] mm: Switch mapping to device mapping [RFC,3/5] btrfs: Add sharedext mount option [RFC,4/5] btrfs: Set s_bdev for btrfs super block [RFC,5/5] btrfs: function to convert file offset to device offset

Message ID

cover.1634933121.git.rgoldwyn@suse.com (mailing list archive)

Headers

From: Goldwyn Rodrigues <rgoldwyn@suse.de>
To: linux-fsdevel@vger.kernel.org
Cc: linux-btrfs@vger.kernel.org, Goldwyn Rodrigues <rgoldwyn@suse.com>
Subject: [RFC PATCH 0/5] Shared memory for shared extents
Date: Fri, 22 Oct 2021 15:15:00 -0500
Message-Id: <cover.1634933121.git.rgoldwyn@suse.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

Shared memory for shared extents | expand

Message

Goldwyn Rodrigues Oct. 22, 2021, 8:15 p.m. UTC

From: Goldwyn Rodrigues <rgoldwyn@suse.com>

This is an attempt to reduce the memory footprint by using a shared
page(s) for shared extent(s) in the filesystem. I am hoping to start a
discussion to iron out the details for implementation.

Abstract
If mutiple files share an extent, reads from each of the files would
read individual page copies in the inode pagecache mapping, leading to
waste of memory. Instead add the read pages of the filesystem to
underlying device's bd_inode as opposed to file's inode mapping. The
cost is that file offset to device offset conversion happens early in
the read cycle.

Motivation:
 - Reduce memory usage for shared extents
 - Ease DAX implementation for shared extents
 - Reduce Container memory utilization

Implementation
In the read path, pages are read and added to the block_device's
inode's mapping as opposed to the inode's mapping. This is limited
to reads, while write's (and read before writes) still go through
inode's i_mapping. The path does check the inode's i_mapping before
falling back to block device's i_mapping to read pages which may be
dirty. The cost of the operation is file_to_device_offset() translation
on reads. The translation should return a valid value only in case
the file is CoW.

This also means that page->mapping may not point to file's mapping.

Questions:
1. Are there security implications for performing this for read-only
pages? An alternate idea is to add a "struct fspage", which would be
pointed by file's mapping and would point to the block device's page.
Multiple files with shared extents have their own independent fspage
pointing to the same page mapped to block device's mapping.
Any security parameters, if required, would be in this structure. The
advantage of this approach is it would be more flexible with respect to
CoW when the page is dirtied after reads. With the current approach, a
read for write is an independent operation so we can end up with two
copies of the same page. This implementation got complicated too quickly.

2. Should pages be dropped after writebacks (and clone_range) to avoid
duplicate copies?

Limitations:
1. The filesystem have exactly one underlying device.
2. Page size should be equal to filesystem block size

Goldwyn Rodrigues (5):
  mm: Use file parameter to determine bdi
  mm: Switch mapping to device mapping
  btrfs: Add sharedext mount option
  btrfs: Set s_bdev for btrfs super block
  btrfs: function to convert file offset to device offset

 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/file.c    | 42 ++++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/super.c   |  7 +++++++
 include/linux/fs.h |  7 ++++++-
 mm/filemap.c       | 34 ++++++++++++++++++++++++++--------
 mm/readahead.c     |  3 +++
 6 files changed, 83 insertions(+), 11 deletions(-)

Comments

Matthew Wilcox Oct. 23, 2021, 1:43 a.m. UTC | #1

On Fri, Oct 22, 2021 at 03:15:00PM -0500, Goldwyn Rodrigues wrote:
> This is an attempt to reduce the memory footprint by using a shared
> page(s) for shared extent(s) in the filesystem. I am hoping to start a
> discussion to iron out the details for implementation.

When you say "Shared extents", you mean reflinks, which are COW, right?
A lot of what you say here confuses me because you talk about dirty
shared pages, and that doesn't make any sense.  A write fault or
call to write() should be intercepted by the filesystem in order to
break the COW.  There's no such thing as a shared page which is dirty,
or has been written back.

You might (or might not!) choose to copy the pages from the shared
extent to the inode when breaking the COW.  But you can't continue
to share them!

> Abstract
> If mutiple files share an extent, reads from each of the files would
> read individual page copies in the inode pagecache mapping, leading to
> waste of memory. Instead add the read pages of the filesystem to
> underlying device's bd_inode as opposed to file's inode mapping. The
> cost is that file offset to device offset conversion happens early in
> the read cycle.
> 
> Motivation:
>  - Reduce memory usage for shared extents
>  - Ease DAX implementation for shared extents
>  - Reduce Container memory utilization
> 
> Implementation
> In the read path, pages are read and added to the block_device's
> inode's mapping as opposed to the inode's mapping. This is limited
> to reads, while write's (and read before writes) still go through
> inode's i_mapping. The path does check the inode's i_mapping before
> falling back to block device's i_mapping to read pages which may be
> dirty. The cost of the operation is file_to_device_offset() translation
> on reads. The translation should return a valid value only in case
> the file is CoW.
> 
> This also means that page->mapping may not point to file's mapping.
> 
> Questions:
> 1. Are there security implications for performing this for read-only
> pages? An alternate idea is to add a "struct fspage", which would be
> pointed by file's mapping and would point to the block device's page.
> Multiple files with shared extents have their own independent fspage
> pointing to the same page mapped to block device's mapping.
> Any security parameters, if required, would be in this structure. The
> advantage of this approach is it would be more flexible with respect to
> CoW when the page is dirtied after reads. With the current approach, a
> read for write is an independent operation so we can end up with two
> copies of the same page. This implementation got complicated too quickly.
> 
> 2. Should pages be dropped after writebacks (and clone_range) to avoid
> duplicate copies?
> 
> Limitations:
> 1. The filesystem have exactly one underlying device.
> 2. Page size should be equal to filesystem block size
> 
> Goldwyn Rodrigues (5):
>   mm: Use file parameter to determine bdi
>   mm: Switch mapping to device mapping
>   btrfs: Add sharedext mount option
>   btrfs: Set s_bdev for btrfs super block
>   btrfs: function to convert file offset to device offset
> 
>  fs/btrfs/ctree.h   |  1 +
>  fs/btrfs/file.c    | 42 ++++++++++++++++++++++++++++++++++++++++--
>  fs/btrfs/super.c   |  7 +++++++
>  include/linux/fs.h |  7 ++++++-
>  mm/filemap.c       | 34 ++++++++++++++++++++++++++--------
>  mm/readahead.c     |  3 +++
>  6 files changed, 83 insertions(+), 11 deletions(-)
> 
> -- 
> 2.33.1
>

Goldwyn Rodrigues Oct. 25, 2021, 2:53 p.m. UTC | #2

On  2:43 23/10, Matthew Wilcox wrote:
> On Fri, Oct 22, 2021 at 03:15:00PM -0500, Goldwyn Rodrigues wrote:
> > This is an attempt to reduce the memory footprint by using a shared
> > page(s) for shared extent(s) in the filesystem. I am hoping to start a
> > discussion to iron out the details for implementation.
> 
> When you say "Shared extents", you mean reflinks, which are COW, right?

Yes, shared extents are extents which are shared on disk by two or more
files. Yes, same as reflinks. Just to explain with an example:

If two files, f1 and f2 have shared extent(s), and both files are read. Each
file's mapping->i_pages will hold a copy of the contents of the shared
extent on disk. So, f1->mapping will have one copy and f2->mapping will
have another copy.

For reads (and only reads), if we use underlying device's mapping, we
can save on duplicate copy of the pages.

> A lot of what you say here confuses me because you talk about dirty
> shared pages, and that doesn't make any sense.  A write fault or
> call to write() should be intercepted by the filesystem in order to
> break the COW.  There's no such thing as a shared page which is dirty,
> or has been written back.


I am _not_ talking of writes at all here. However, just to clarify the
air: For a CoW environment,  if a write happens, it becomes the
filesystems responsibility to write it to a new device location. These
pages should still end up in inode's i_mapping, as it is done currently.
This also includes pages which are read before modifying/writing.
The filesystem will writeback these pages to a new device location.
My question in the end was if we should release these  pages
or perhaps move them to device's mapping once they are no
longer dirty - for subsequence reads.

> 
> You might (or might not!) choose to copy the pages from the shared
> extent to the inode when breaking the COW.  But you can't continue
> to share them!

Since a write goes through inode's i_mapping, it will go to a new device
location and will break the share. There is no change in this
phenomenon. What I am changing is only the "pure" reads from CoW
filesystems. IOW, there is an order to look pagecache for reads - 
1. inode->i_mapping->i_pages
2. inode->sb->s_bdev->bd_inode->i_mapping->i_pages (after file offset to
device).

This way reads from dirty pages will go through inode->i_mapping.

> 
> > Abstract
> > If mutiple files share an extent, reads from each of the files would
> > read individual page copies in the inode pagecache mapping, leading to
> > waste of memory. Instead add the read pages of the filesystem to
> > underlying device's bd_inode as opposed to file's inode mapping. The
> > cost is that file offset to device offset conversion happens early in
> > the read cycle.
> > 
> > Motivation:
> >  - Reduce memory usage for shared extents
> >  - Ease DAX implementation for shared extents
> >  - Reduce Container memory utilization
> > 
> > Implementation
> > In the read path, pages are read and added to the block_device's
> > inode's mapping as opposed to the inode's mapping. This is limited
> > to reads, while write's (and read before writes) still go through
> > inode's i_mapping. The path does check the inode's i_mapping before
> > falling back to block device's i_mapping to read pages which may be
> > dirty. The cost of the operation is file_to_device_offset() translation
> > on reads. The translation should return a valid value only in case
> > the file is CoW.
> > 
> > This also means that page->mapping may not point to file's mapping.
> > 
> > Questions:
> > 1. Are there security implications for performing this for read-only
> > pages? An alternate idea is to add a "struct fspage", which would be
> > pointed by file's mapping and would point to the block device's page.
> > Multiple files with shared extents have their own independent fspage
> > pointing to the same page mapped to block device's mapping.
> > Any security parameters, if required, would be in this structure. The
> > advantage of this approach is it would be more flexible with respect to
> > CoW when the page is dirtied after reads. With the current approach, a
> > read for write is an independent operation so we can end up with two
> > copies of the same page. This implementation got complicated too quickly.
> > 
> > 2. Should pages be dropped after writebacks (and clone_range) to avoid
> > duplicate copies?
> > 
> > Limitations:
> > 1. The filesystem have exactly one underlying device.
> > 2. Page size should be equal to filesystem block size
> > 
> > Goldwyn Rodrigues (5):
> >   mm: Use file parameter to determine bdi
> >   mm: Switch mapping to device mapping
> >   btrfs: Add sharedext mount option
> >   btrfs: Set s_bdev for btrfs super block
> >   btrfs: function to convert file offset to device offset
> > 
> >  fs/btrfs/ctree.h   |  1 +
> >  fs/btrfs/file.c    | 42 ++++++++++++++++++++++++++++++++++++++++--
> >  fs/btrfs/super.c   |  7 +++++++
> >  include/linux/fs.h |  7 ++++++-
> >  mm/filemap.c       | 34 ++++++++++++++++++++++++++--------
> >  mm/readahead.c     |  3 +++
> >  6 files changed, 83 insertions(+), 11 deletions(-)
> > 
> > -- 
> > 2.33.1
> >

Matthew Wilcox Oct. 25, 2021, 3:43 p.m. UTC | #3

On Mon, Oct 25, 2021 at 09:53:01AM -0500, Goldwyn Rodrigues wrote:
> On  2:43 23/10, Matthew Wilcox wrote:
> > On Fri, Oct 22, 2021 at 03:15:00PM -0500, Goldwyn Rodrigues wrote:
> > > This is an attempt to reduce the memory footprint by using a shared
> > > page(s) for shared extent(s) in the filesystem. I am hoping to start a
> > > discussion to iron out the details for implementation.
> > 
> > When you say "Shared extents", you mean reflinks, which are COW, right?
> 
> Yes, shared extents are extents which are shared on disk by two or more
> files. Yes, same as reflinks. Just to explain with an example:
> 
> If two files, f1 and f2 have shared extent(s), and both files are read. Each
> file's mapping->i_pages will hold a copy of the contents of the shared
> extent on disk. So, f1->mapping will have one copy and f2->mapping will
> have another copy.
> 
> For reads (and only reads), if we use underlying device's mapping, we
> can save on duplicate copy of the pages.

Yes; I'm familiar with the problem.  Dave Chinner and I had a great
discussion about it at LCA a couple of years ago.

The implementation I've had in mind for a while is that the filesystem
either creates a separate inode for a shared extent, or (as you've
done here) uses the bdev's inode.  We can discuss the pros/cons of
that separately.

To avoid the double-lookup problem, I was intending to generalise DAX
entries into PFN entries.  That way, if the read() (or mmap read fault)
misses in the inode's cache, we can look up the shared extent cache,
and then cache the physical address of the memory in the inode.

That makes reclaim/eviction of the page in the shared extent more
expensive because you have to iterate all the inodes which share the
extent and remove the PFN entries before the page can be reused.

Perhaps we should have a Zoom meeting about this before producing duelling
patch series?  I can host if you're interested.

Goldwyn Rodrigues Oct. 25, 2021, 4:43 p.m. UTC | #4

On 16:43 25/10, Matthew Wilcox wrote:
> On Mon, Oct 25, 2021 at 09:53:01AM -0500, Goldwyn Rodrigues wrote:
> > On  2:43 23/10, Matthew Wilcox wrote:
> > > On Fri, Oct 22, 2021 at 03:15:00PM -0500, Goldwyn Rodrigues wrote:
> > > > This is an attempt to reduce the memory footprint by using a shared
> > > > page(s) for shared extent(s) in the filesystem. I am hoping to start a
> > > > discussion to iron out the details for implementation.
> > > 
> > > When you say "Shared extents", you mean reflinks, which are COW, right?
> > 
> > Yes, shared extents are extents which are shared on disk by two or more
> > files. Yes, same as reflinks. Just to explain with an example:
> > 
> > If two files, f1 and f2 have shared extent(s), and both files are read. Each
> > file's mapping->i_pages will hold a copy of the contents of the shared
> > extent on disk. So, f1->mapping will have one copy and f2->mapping will
> > have another copy.
> > 
> > For reads (and only reads), if we use underlying device's mapping, we
> > can save on duplicate copy of the pages.
> 
> Yes; I'm familiar with the problem.  Dave Chinner and I had a great
> discussion about it at LCA a couple of years ago.
> 
> The implementation I've had in mind for a while is that the filesystem
> either creates a separate inode for a shared extent, or (as you've
> done here) uses the bdev's inode.  We can discuss the pros/cons of
> that separately.
> 
> To avoid the double-lookup problem, I was intending to generalise DAX
> entries into PFN entries.  That way, if the read() (or mmap read fault)
> misses in the inode's cache, we can look up the shared extent cache,
> and then cache the physical address of the memory in the inode.

I am not sure I understand. Could you provide an example? Would this be
specific to DAX? What about standard block devices?

> 
> That makes reclaim/eviction of the page in the shared extent more
> expensive because you have to iterate all the inodes which share the
> extent and remove the PFN entries before the page can be reused.

Not sure of this, but won't it complicate things if there are different
shared extents in different files? Say shared extent SE1 belongs to f1
and f2, where as SE2 belongs to f2 and f3?

> 
> Perhaps we should have a Zoom meeting about this before producing duelling
> patch series?  I can host if you're interested.

Yes, I think that would be nice. I am in the central US Timezone.
If possible, I would like to add David Disseldorp who is based in
Germany.