[3/3] RFC: dax: dax_prepare_freeze

Message ID	55100D10.6090902@plexistor.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-nvdimm-bounces@lists.01.org> Message-ID: <55100D10.6090902@plexistor.com> Date: Mon, 23 Mar 2015 14:54:40 +0200 From: Boaz Harrosh <boaz@plexistor.com> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0 MIME-Version: 1.0 To: Dave Chinner <david@fromorbit.com>, Matthew Wilcox <matthew.r.wilcox@intel.com>, Andrew Morton <akpm@linux-foundation.org>, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, Jan Kara <jack@suse.cz>, Hugh Dickins <hughd@google.com>, Mel Gorman <mgorman@suse.de>, linux-mm@kvack.org, linux-nvdimm <linux-nvdimm@ml01.01.org>, linux-fsdevel <linux-fsdevel@vger.kernel.org>, Eryu Guan <eguan@redhat.com> References: <55100B78.501@plexistor.com> In-Reply-To: <55100B78.501@plexistor.com> Subject: [Linux-nvdimm] [PATCH 3/3] RFC: dax: dax_prepare_freeze Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" <linux-nvdimm-bounces@lists.01.org>

Boaz Harrosh March 23, 2015, 12:54 p.m. UTC

From: Boaz Harrosh <boaz@plexistor.com>

When freezing an FS, we must write protect all IS_DAX()
inodes that have an mmap mapping on an inode. Otherwise
application will be able to modify previously faulted-in
file pages.

I'm actually doing a full unmap_mapping_range because
there is no readily available "mapping_write_protect" like
functionality. I do not think it is worth it to define one
just for here and just for some extra read-faults after an
fs_freeze.

How hot-path is fs_freeze at all?

CC: Jan Kara <jack@suse.cz>
CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
CC: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
---
 fs/dax.c           | 30 ++++++++++++++++++++++++++++++
 fs/super.c         |  3 +++
 include/linux/fs.h |  1 +
 3 files changed, 34 insertions(+)

Dave Chinner March 23, 2015, 10:40 p.m. UTC | #1

On Mon, Mar 23, 2015 at 02:54:40PM +0200, Boaz Harrosh wrote:
> From: Boaz Harrosh <boaz@plexistor.com>
> 
> When freezing an FS, we must write protect all IS_DAX()
> inodes that have an mmap mapping on an inode. Otherwise
> application will be able to modify previously faulted-in
> file pages.

All you need to do is lock out page faults once the page is clean;
that's what the sb_start_pagefault() calls are for in the page fault
path - they catch write faults and block them until the filesystem
is unfrozen. Hence I'm not sure why this would be necessary if you
are catching write faults in .pfn_mkwrite....

Cheers,

Dave.

Boaz Harrosh March 24, 2015, 6:14 a.m. UTC | #2

On 03/24/2015 12:40 AM, Dave Chinner wrote:
> On Mon, Mar 23, 2015 at 02:54:40PM +0200, Boaz Harrosh wrote:
>> From: Boaz Harrosh <boaz@plexistor.com>
>>
>> When freezing an FS, we must write protect all IS_DAX()
>> inodes that have an mmap mapping on an inode. Otherwise
>> application will be able to modify previously faulted-in
>> file pages.
> 
> All you need to do is lock out page faults once the page is clean;
> that's what the sb_start_pagefault() calls are for in the page fault
> path - they catch write faults and block them until the filesystem
> is unfrozen. Hence I'm not sure why this would be necessary if you
> are catching write faults in .pfn_mkwrite....
> 

Jan pointed it out and he was right I have a test for this. What
happens is that since we had a mapping from before the freeze we will
not have a page-fault. And the buffers will be modified.

As Jan explained in the cache path we do a writeback which turns
all pages to read-only. But with dax we do not have writeback
so the pages stay read-write mapped. Something needs to loop
through the pages and write-protect them. I chose to unmap
them because it is the much-much smaller code, and I do not care
to optimize the freeze.

[Yes, sigh, I will convert the test to an xfstest. May I just add
 it to some existing fs_freeze test. Only novelty is that we need
 to write-access an mmap block before the freeze-start, then continue
 access after the freeze and see modifications
]

> Cheers,
> Dave.
> 

Thanks
Boaz

Boaz Harrosh March 24, 2015, 12:37 p.m. UTC | #3

On 03/23/2015 02:54 PM, Boaz Harrosh wrote:
> From: Boaz Harrosh <boaz@plexistor.com>
> 
> When freezing an FS, we must write protect all IS_DAX()
> inodes that have an mmap mapping on an inode. Otherwise
> application will be able to modify previously faulted-in
> file pages.
> 
> I'm actually doing a full unmap_mapping_range because
> there is no readily available "mapping_write_protect" like
> functionality. I do not think it is worth it to define one
> just for here and just for some extra read-faults after an
> fs_freeze.
> 
> How hot-path is fs_freeze at all?
> 

OK So reinspecting this was a complete raw RFC. I need to do
more work on this thing

comments below ...

> CC: Jan Kara <jack@suse.cz>
> CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
> CC: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
> ---
>  fs/dax.c           | 30 ++++++++++++++++++++++++++++++
>  fs/super.c         |  3 +++
>  include/linux/fs.h |  1 +
>  3 files changed, 34 insertions(+)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index d0bd1f4..f3fc28b 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -549,3 +549,33 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
>  	return dax_zero_page_range(inode, from, length, get_block);
>  }
>  EXPORT_SYMBOL_GPL(dax_truncate_page);
> +
> +/* This is meant to be called as part of freeze_super. otherwise we might
> + * Need some extra locking before calling here.
> + */
> +void dax_prepare_freeze(struct super_block *sb)
> +{
> +	struct inode *inode;
> +
> +	/* TODO: each DAX fs has some private mount option to enable DAX. If
> +	 * We made that option a generic MS_DAX_ENABLE super_block flag we could
> +	 * Avoid the 95% extra unneeded loop-on-all-inodes every freeze.
> +	 * if (!(sb->s_flags & MS_DAX_ENABLE))
> +	 *	return 0;
> +	 */
> +
> +	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +		/* TODO: For freezing we can actually do with write-protecting
> +		 * the page. But I cannot find a ready made function that does
> +		 * that for a giving mapping (with all the proper locking).
> +		 * How performance sensitive is the all sb_freeze API?
> +		 * For now we can just unmap the all mapping, and pay extra
> +		 * on read faults.
> +		 */
> +		/* NOTE: Do not unmap private COW mapped pages it will not
> +		 * modify the FS.
> +		 */
> +		if (IS_DAX(inode))
> +			unmap_mapping_range(inode->i_mapping, 0, 0, 0);

So what happens here is that we loop on all sb->s_inodes every freeze
and in the not DAX case just do nothing.

It could be nice to have a flag at the sb level to tel us if we need
to expect IS_DAX() inodes at all, for example when we are mounted on
an harddisk it should not be set.

All of ext2/4 and now Dave's xfs have their own
	XFS_MOUNT_DAX / EXT2_MOUNT_DAX / EXT4_MOUNT_DAX

Is it OK if I unify all this on sb->s_flags |= MS_MOUNT_DAX so I can check it
here in Generic code? The option parsing will be done by each FS but
the flag be global?

> +	}
> +}
> diff --git a/fs/super.c b/fs/super.c
> index 2b7dc90..9ef490c 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1329,6 +1329,9 @@ int freeze_super(struct super_block *sb)
>  	/* All writers are done so after syncing there won't be dirty data */
>  	sync_filesystem(sb);
>  
> +	/* Need to take care of DAX mmaped inodes */
> +	dax_prepare_freeze(sb);
> +

So if CONFIG_FS_DAX is not set this will not compile I need to
define an empty one if not set

Cheers
Boaz


>  	/* Now wait for internal filesystem counter */
>  	sb->s_writers.frozen = SB_FREEZE_FS;
>  	smp_wmb();
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 24af817..3b943d4 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2599,6 +2599,7 @@ int dax_truncate_page(struct inode *, loff_t from, get_block_t);
>  int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
>  int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
>  #define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)
> +void dax_prepare_freeze(struct super_block *sb);
>  
>  #ifdef CONFIG_BLOCK
>  typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode,
>

Dave Chinner March 25, 2015, 2:22 a.m. UTC | #4

On Tue, Mar 24, 2015 at 08:14:59AM +0200, Boaz Harrosh wrote:
> On 03/24/2015 12:40 AM, Dave Chinner wrote:
> > On Mon, Mar 23, 2015 at 02:54:40PM +0200, Boaz Harrosh wrote:
> >> From: Boaz Harrosh <boaz@plexistor.com>
> >>
> >> When freezing an FS, we must write protect all IS_DAX()
> >> inodes that have an mmap mapping on an inode. Otherwise
> >> application will be able to modify previously faulted-in
> >> file pages.
> > 
> > All you need to do is lock out page faults once the page is clean;
> > that's what the sb_start_pagefault() calls are for in the page fault
> > path - they catch write faults and block them until the filesystem
> > is unfrozen. Hence I'm not sure why this would be necessary if you
> > are catching write faults in .pfn_mkwrite....
> > 
> 
> Jan pointed it out and he was right I have a test for this. What
> happens is that since we had a mapping from before the freeze we will
> not have a page-fault. And the buffers will be modified.
> 
> As Jan explained in the cache path we do a writeback which turns
> all pages to read-only. But with dax we do not have writeback
> so the pages stay read-write mapped. Something needs to loop
> through the pages and write-protect them. I chose to unmap
> them because it is the much-much smaller code, and I do not care
> to optimize the freeze.

Then we have wider problem with DAX, then: sync doesn't work
properly. i.e. if we still has write mapped pages, then we haven't
flushed dirty cache lines on write-mapped files to the persistent
domain by the time sync completes.

So, this shouldn't be some special case that only the freeze code
takes into account - we need to make sure that sync (and therefore
freeze) flushes all dirty cache lines and marks all mappings
clean....

Cheers,

Dave.

Dave Chinner March 25, 2015, 2:26 a.m. UTC | #5

On Tue, Mar 24, 2015 at 02:37:45PM +0200, Boaz Harrosh wrote:
> On 03/23/2015 02:54 PM, Boaz Harrosh wrote:
> > From: Boaz Harrosh <boaz@plexistor.com>
> > 
> > When freezing an FS, we must write protect all IS_DAX()
> > inodes that have an mmap mapping on an inode. Otherwise
> > application will be able to modify previously faulted-in
> > file pages.
> > 
> > I'm actually doing a full unmap_mapping_range because
> > there is no readily available "mapping_write_protect" like
> > functionality. I do not think it is worth it to define one
> > just for here and just for some extra read-faults after an
> > fs_freeze.
> > 
> > How hot-path is fs_freeze at all?
> > 
> 
> OK So reinspecting this was a complete raw RFC. I need to do
> more work on this thing
> 
> comments below ...
> 
> > CC: Jan Kara <jack@suse.cz>
> > CC: Matthew Wilcox <matthew.r.wilcox@intel.com>
> > CC: Andrew Morton <akpm@linux-foundation.org>
> > Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
> > ---
> >  fs/dax.c           | 30 ++++++++++++++++++++++++++++++
> >  fs/super.c         |  3 +++
> >  include/linux/fs.h |  1 +
> >  3 files changed, 34 insertions(+)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index d0bd1f4..f3fc28b 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -549,3 +549,33 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
> >  	return dax_zero_page_range(inode, from, length, get_block);
> >  }
> >  EXPORT_SYMBOL_GPL(dax_truncate_page);
> > +
> > +/* This is meant to be called as part of freeze_super. otherwise we might
> > + * Need some extra locking before calling here.
> > + */
> > +void dax_prepare_freeze(struct super_block *sb)
> > +{
> > +	struct inode *inode;
> > +
> > +	/* TODO: each DAX fs has some private mount option to enable DAX. If
> > +	 * We made that option a generic MS_DAX_ENABLE super_block flag we could
> > +	 * Avoid the 95% extra unneeded loop-on-all-inodes every freeze.
> > +	 * if (!(sb->s_flags & MS_DAX_ENABLE))
> > +	 *	return 0;
> > +	 */
> > +
> > +	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {

missing locking.

> > +		/* TODO: For freezing we can actually do with write-protecting
> > +		 * the page. But I cannot find a ready made function that does
> > +		 * that for a giving mapping (with all the proper locking).
> > +		 * How performance sensitive is the all sb_freeze API?
> > +		 * For now we can just unmap the all mapping, and pay extra
> > +		 * on read faults.
> > +		 */
> > +		/* NOTE: Do not unmap private COW mapped pages it will not
> > +		 * modify the FS.
> > +		 */
> > +		if (IS_DAX(inode))
> > +			unmap_mapping_range(inode->i_mapping, 0, 0, 0);
> 
> So what happens here is that we loop on all sb->s_inodes every freeze
> and in the not DAX case just do nothing.

Which is real bad and known to be a performance issue. See Josef's
recent sync scalability patchset posting that only tracks and walks
dirty inodes...

> It could be nice to have a flag at the sb level to tel us if we need
> to expect IS_DAX() inodes at all, for example when we are mounted on
> an harddisk it should not be set.
> 
> All of ext2/4 and now Dave's xfs have their own
> 	XFS_MOUNT_DAX / EXT2_MOUNT_DAX / EXT4_MOUNT_DAX
> 
> Is it OK if I unify all this on sb->s_flags |= MS_MOUNT_DAX so I can check it
> here in Generic code? The option parsing will be done by each FS but
> the flag be global?

No, because as I mentioned in another thread we're going to end up
with filesystems that don't have "mount wide" DAX behaviour, and we
have to check every dirty inode anyway. And....

> > diff --git a/fs/super.c b/fs/super.c
> > index 2b7dc90..9ef490c 100644
> > --- a/fs/super.c
> > +++ b/fs/super.c
> > @@ -1329,6 +1329,9 @@ int freeze_super(struct super_block *sb)
> >  	/* All writers are done so after syncing there won't be dirty data */
> >  	sync_filesystem(sb);
> >  
> > +	/* Need to take care of DAX mmaped inodes */
> > +	dax_prepare_freeze(sb);
> > +
> 
> So if CONFIG_FS_DAX is not set this will not compile I need to
> define an empty one if not set

... it's the wrong approach - sync_filesystem(sb) shoul dbe handling
this problem, so that sync and fsync work correctly, and then you
don't care about whether DAX is supported or not...

Cheers,

Dave.

Boaz Harrosh March 25, 2015, 8:10 a.m. UTC | #6

On 03/25/2015 04:22 AM, Dave Chinner wrote:
> On Tue, Mar 24, 2015 at 08:14:59AM +0200, Boaz Harrosh wrote:
<>
> 
> Then we have wider problem with DAX, then: sync doesn't work
> properly. i.e. if we still has write mapped pages, then we haven't
> flushed dirty cache lines on write-mapped files to the persistent
> domain by the time sync completes.
> 
> So, this shouldn't be some special case that only the freeze code
> takes into account - we need to make sure that sync (and therefore
> freeze) flushes all dirty cache lines and marks all mappings
> clean....
> 

This is not how I understood it and how I read the code.

The sync does happen, .fsync of the FS is called on each
file just as if the user called it. If this is broken it just
needs to be fixed there at the .fsync vector. POSIX mandate
persistence at .fsync so at the vfs layer we rely on that.

So everything at this stage should be synced to real media.

What does not happen is writeback. since dax does not have
any writeback. And because of that nothing turned the
user mappings to read only. This is what I do here but
instead of write-protecting I just unmap because it is
easier for me to code it.

> Cheers,
> Dave.

Cheers
Boaz

Boaz Harrosh March 25, 2015, 8:31 a.m. UTC | #7

On 03/25/2015 04:26 AM, Dave Chinner wrote:
<>
>>> +	/* TODO: each DAX fs has some private mount option to enable DAX. If
>>> +	 * We made that option a generic MS_DAX_ENABLE super_block flag we could
>>> +	 * Avoid the 95% extra unneeded loop-on-all-inodes every freeze.
>>> +	 * if (!(sb->s_flags & MS_DAX_ENABLE))
>>> +	 *	return 0;
>>> +	 */
>>> +
>>> +	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> 
> missing locking.
> 

I will please need help here. This is very deep inside the freeze process
we area already holding bunch of locks. We know that nothing can be modified
at this stage. We are completely read-only.

Only thing I can see that can happen is inode eviction do to oom. So
do I need an iget. But how do I know the iget is allowed here?

OK I do not have a clue what locks do I need, this deep in the freeze?

>>> +		/* TODO: For freezing we can actually do with write-protecting
>>> +		 * the page. But I cannot find a ready made function that does
>>> +		 * that for a giving mapping (with all the proper locking).
>>> +		 * How performance sensitive is the all sb_freeze API?
>>> +		 * For now we can just unmap the all mapping, and pay extra
>>> +		 * on read faults.
>>> +		 */
>>> +		/* NOTE: Do not unmap private COW mapped pages it will not
>>> +		 * modify the FS.
>>> +		 */
>>> +		if (IS_DAX(inode))
>>> +			unmap_mapping_range(inode->i_mapping, 0, 0, 0);
>>
>> So what happens here is that we loop on all sb->s_inodes every freeze
>> and in the not DAX case just do nothing.
> 
> Which is real bad and known to be a performance issue. See Josef's
> recent sync scalability patchset posting that only tracks and walks
> dirty inodes...
> 

Sure but how hot is freeze? Josef's fixed the very hot sync path,
but freeze happens once in a blue moon. Do we care?

>> It could be nice to have a flag at the sb level to tel us if we need
>> to expect IS_DAX() inodes at all, for example when we are mounted on
>> an harddisk it should not be set.
>>
>> All of ext2/4 and now Dave's xfs have their own
>> 	XFS_MOUNT_DAX / EXT2_MOUNT_DAX / EXT4_MOUNT_DAX
>>
>> Is it OK if I unify all this on sb->s_flags |= MS_MOUNT_DAX so I can check it
>> here in Generic code? The option parsing will be done by each FS but
>> the flag be global?
> 
> No, because as I mentioned in another thread we're going to end up
> with filesystems that don't have "mount wide" DAX behaviour, and we
> have to check every dirty inode anyway. And....
> 

Sure! but let us contract with the FS, that please set the MS_MOUNT_DAX
if there is any chance at all that IS_DAX() comes out true, so we loop
here. 

OK You know what, I will change this check to be:
	if (sb->s_bdev->bd_disk->fops->direct_access)

BTW: We must loop this way on every sb inode because we do not have
dirty inodes. There is no "dirty"ing going on in dax, not of inodes
and not of pages.

>>> diff --git a/fs/super.c b/fs/super.c
>>> index 2b7dc90..9ef490c 100644
>>> --- a/fs/super.c
>>> +++ b/fs/super.c
>>> @@ -1329,6 +1329,9 @@ int freeze_super(struct super_block *sb)
>>>  	/* All writers are done so after syncing there won't be dirty data */
>>>  	sync_filesystem(sb);
>>>  
>>> +	/* Need to take care of DAX mmaped inodes */
>>> +	dax_prepare_freeze(sb);
>>> +
>>
>> So if CONFIG_FS_DAX is not set this will not compile I need to
>> define an empty one if not set
> 
> ... it's the wrong approach - sync_filesystem(sb) shoul dbe handling
> this problem, so that sync and fsync work correctly, and then you
> don't care about whether DAX is supported or not...
> 

sync and fsync should and will work correctly, but this does not
solve our problem. because what turns pages to read-only is the
writeback. And we do not have this in dax. Therefore we need to
do this here as a special case.

> Cheers,
> Dave.
> 

I have a new patchset with all this, I will send it once it is fully
tested, I have problems testing both freeze and splice there are not
any good tests that I could find that do what I want, so still working.

Thanks
Boaz

Dave Chinner March 25, 2015, 9:29 a.m. UTC | #8

On Wed, Mar 25, 2015 at 10:10:31AM +0200, Boaz Harrosh wrote:
> On 03/25/2015 04:22 AM, Dave Chinner wrote:
> > On Tue, Mar 24, 2015 at 08:14:59AM +0200, Boaz Harrosh wrote:
> <>
> > 
> > Then we have wider problem with DAX, then: sync doesn't work
> > properly. i.e. if we still has write mapped pages, then we haven't
> > flushed dirty cache lines on write-mapped files to the persistent
> > domain by the time sync completes.
> > 
> > So, this shouldn't be some special case that only the freeze code
> > takes into account - we need to make sure that sync (and therefore
> > freeze) flushes all dirty cache lines and marks all mappings
> > clean....
> > 
> 
> This is not how I understood it and how I read the code.
> 
> The sync does happen, .fsync of the FS is called on each
> file just as if the user called it. If this is broken it just
> needs to be fixed there at the .fsync vector. POSIX mandate
> persistence at .fsync so at the vfs layer we rely on that.

right now, the filesystems will see that there are no dirty pages
on the inode, and then just sync the inode metadata. They will do
nothing else as filesystems are not aware of CPU cachelines at all.

> So everything at this stage should be synced to real media.

Actually no. This is what intel are introducing new CPU instructions
for - so fsync can flush the cpu caches and commit them to th
persistence domain correctly.

> What does not happen is writeback. since dax does not have
> any writeback.

Which is precisely the problem we need to address - we don't need
writeback to a block device, but we do need the dirty CPU cachelines
flushed and the mappings cleaned.

> And because of that nothing turned the
> user mappings to read only. This is what I do here but
> instead of write-protecting I just unmap because it is
> easier for me to code it.

That doesn't mean it is the correct solution.

Cheers,

Dave.

Dave Chinner March 25, 2015, 9:41 a.m. UTC | #9

On Wed, Mar 25, 2015 at 10:31:22AM +0200, Boaz Harrosh wrote:
> On 03/25/2015 04:26 AM, Dave Chinner wrote:
> <>
> >>> +	/* TODO: each DAX fs has some private mount option to enable DAX. If
> >>> +	 * We made that option a generic MS_DAX_ENABLE super_block flag we could
> >>> +	 * Avoid the 95% extra unneeded loop-on-all-inodes every freeze.
> >>> +	 * if (!(sb->s_flags & MS_DAX_ENABLE))
> >>> +	 *	return 0;
> >>> +	 */
> >>> +
> >>> +	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> > 
> > missing locking.
> > 
> 
> I will please need help here. This is very deep inside the freeze process
> we area already holding bunch of locks. We know that nothing can be modified
> at this stage. We are completely read-only.

Which means we could stillbe reading new inodes in off disk and
hence the sb->s_inodes list can be changing. Memory reclaim can be
running via the shrinker, freeing clean inodes, hence the
sb->s_inodes list can be changing.

From fs/inode.c:

/*
 * Inode locking rules:
.....
 * inode_sb_list_lock protects:
 *   sb->s_inodes, inode->i_sb_list

This...

> >>> +		/* TODO: For freezing we can actually do with write-protecting
> >>> +		 * the page. But I cannot find a ready made function that does
> >>> +		 * that for a giving mapping (with all the proper locking).
> >>> +		 * How performance sensitive is the all sb_freeze API?
> >>> +		 * For now we can just unmap the all mapping, and pay extra
> >>> +		 * on read faults.
> >>> +		 */
> >>> +		/* NOTE: Do not unmap private COW mapped pages it will not
> >>> +		 * modify the FS.
> >>> +		 */
> >>> +		if (IS_DAX(inode))
> >>> +			unmap_mapping_range(inode->i_mapping, 0, 0, 0);
> >>
> >> So what happens here is that we loop on all sb->s_inodes every freeze
> >> and in the not DAX case just do nothing.
> > 
> > Which is real bad and known to be a performance issue. See Josef's
> > recent sync scalability patchset posting that only tracks and walks
> > dirty inodes...
> 
> Sure but how hot is freeze? Josef's fixed the very hot sync path,
> but freeze happens once in a blue moon. Do we care?

Yes, because if you have 50 million cached inodes on a filesystem,
it's going to take a long time to traverse them all, and right now
the inode_sb_list_lock is a *global lock*.

> >> It could be nice to have a flag at the sb level to tel us if we need
> >> to expect IS_DAX() inodes at all, for example when we are mounted on
> >> an harddisk it should not be set.
> >>
> >> All of ext2/4 and now Dave's xfs have their own
> >> 	XFS_MOUNT_DAX / EXT2_MOUNT_DAX / EXT4_MOUNT_DAX
> >>
> >> Is it OK if I unify all this on sb->s_flags |= MS_MOUNT_DAX so I can check it
> >> here in Generic code? The option parsing will be done by each FS but
> >> the flag be global?
> > 
> > No, because as I mentioned in another thread we're going to end up
> > with filesystems that don't have "mount wide" DAX behaviour, and we
> > have to check every dirty inode anyway. And....
> > 
> 
> Sure! but let us contract with the FS, that please set the MS_MOUNT_DAX
> if there is any chance at all that IS_DAX() comes out true, so we loop
> here. 

The mount option is irrelevant here - we should only be looping over
dirty inodes.  We don't care if they are DAX or not - we have to
iterate them and ensure they are properly clean. We already have
infrastructure to do this - we should use it and fix the problem
once and for all rather than hacking special case code into random
places.

> BTW: We must loop this way on every sb inode because we do not have
> dirty inodes. There is no "dirty"ing going on in dax, not of inodes
> and not of pages.

Precisely the problem we need to address. We do have dirty inodes,
we just never set the fact they have dirty "pages" on them and hence
never do data writeback on them.

> > ... it's the wrong approach - sync_filesystem(sb) shoul dbe handling
> > this problem, so that sync and fsync work correctly, and then you
> > don't care about whether DAX is supported or not...
> > 
> 
> sync and fsync should and will work correctly, but this does not
> solve our problem. because what turns pages to read-only is the
> writeback. And we do not have this in dax. Therefore we need to
> do this here as a special case.

We can still use exactly the same dirty tracking as we use for data
writeback. The difference is that we don't need to go through all
teh page writeback; we can just flush the CPU caches and mark all
the mappings clean, then clear the I_DIRTY_PAGES flag and move on to
inode writeback....

Cheers,

Dave.

Boaz Harrosh March 25, 2015, 10:19 a.m. UTC | #10

On 03/25/2015 11:29 AM, Dave Chinner wrote:
> On Wed, Mar 25, 2015 at 10:10:31AM +0200, Boaz Harrosh wrote:
>> On 03/25/2015 04:22 AM, Dave Chinner wrote:
>>> On Tue, Mar 24, 2015 at 08:14:59AM +0200, Boaz Harrosh wrote:
>> <>
<>
>> The sync does happen, .fsync of the FS is called on each
>> file just as if the user called it. If this is broken it just
>> needs to be fixed there at the .fsync vector. POSIX mandate
>> persistence at .fsync so at the vfs layer we rely on that.
> 
> right now, the filesystems will see that there are no dirty pages
> on the inode, and then just sync the inode metadata. They will do
> nothing else as filesystems are not aware of CPU cachelines at all.
> 

Sigh yes. There is this bug. And I am sitting on a wide fix for this.

The strategy is. All Kernel writes are done with a new copy_user_nt.
NT stands for none-temporal. This shows 20% improvements since cachelines
need not be fetched when written too.

The arches that do not have NT instructions, will use a generic
copy_user_nt that does a copy_user and then flush cashes.
Same flush cashes we do before DMA IO. (effectively every 4k)
[Its more complicated with the edges and all, by I have solved
 all this. Will post in a week or two]

So what is left is the mmaped inodes. The logic here is that
at .fsync vector dax inodes will do a cl_flush only if mapping_mapped()
is true. Also .msync is the same as .fsync

And one last thing we also call .fsync at vm_operations_struct->close
because it is allowed for an app to do mmap, munmap, .fsync so we just
call dax .fsync at munmap always.

So by now we should be covered for fsync guaranty.

>> So everything at this stage should be synced to real media.
> 
> Actually no. This is what intel are introducing new CPU instructions
> for - so fsync can flush the cpu caches and commit them to th
> persistence domain correctly.
> 

The new intel instructions are for an optimization, and they will
fit in the picture for the CPUs that have it. But there are already
NT instructions for existing CPUs. (Just not as fast and precise)

Every ARCH will do its best under a small API
	copy_user_nt	- data is at media
	memset_nt	- data is at media
	cl_flush	- partial written cachelines flushed to media 
	sfence		- New data seen by all CPUs

>> What does not happen is writeback. since dax does not have
>> any writeback.
> 
> Which is precisely the problem we need to address - we don't need
> writeback to a block device, but we do need the dirty CPU cachelines
> flushed and the mappings cleaned.
> 

I see what you mean. Since nothing dirtied the inode then above
.fsync will not be called and we have not pushed mmap data to
media.

Again here we only need to do this for mmaped inodes, because
Kernel written data is (will be) written NT style.

>> And because of that nothing turned the
>> user mappings to read only. This is what I do here but
>> instead of write-protecting I just unmap because it is
>> easier for me to code it.
> 
> That doesn't mean it is the correct solution.

Please note that even if we properly .fsync cachlines the page-faults
are orthogonal to this. There is no point in making mmapped dax pages
read-only after every .fsync and pay a page-fault. We should leave them
mapped has is. The only place that we need page protection is at freeze
time.

But I see that we might have a problem with .fsync not being called.
I see that you sent a second mail. I'll try to answer there.

> Cheers,
> Dave.

Thanks
Boaz

Boaz Harrosh March 25, 2015, 10:40 a.m. UTC | #11

On 03/25/2015 11:41 AM, Dave Chinner wrote:
> On Wed, Mar 25, 2015 at 10:31:22AM +0200, Boaz Harrosh wrote:
>> On 03/25/2015 04:26 AM, Dave Chinner wrote:
<>
>> sync and fsync should and will work correctly, but this does not
>> solve our problem. because what turns pages to read-only is the
>> writeback. And we do not have this in dax. Therefore we need to
>> do this here as a special case.
> 
> We can still use exactly the same dirty tracking as we use for data
> writeback. The difference is that we don't need to go through all
> teh page writeback; we can just flush the CPU caches and mark all
> the mappings clean, then clear the I_DIRTY_PAGES flag and move on to
> inode writeback....
> 

I see what you mean. the sb wide sync will not step into mmaped inodes
and fsync them.

If we go my way and write NT (None Temporal) style in Kernel.
NT instructions exist since xeon and all the Intel iX core CPUs have
them. In tests we conducted doing xeon NT-writes vs
regular-writes-and-cl_flush at .fsync showed minimum of 20% improvement.
That is on very large IOs. On 4k IOs it was even better.

It looks like you have a much better picture in your mind how to
fit this properly at the inode-dirty picture. Can you attempt a rough draft?

If we are going the NT way. Then we can only I_DIRTY_ track the mmaped
inodes. For me this is really scary because I do not want to trigger
any writeback threads. If you could please draw me an outline (or write
something up ;-)) it would be great.

> Cheers,
> Dave.

Thanks
Boaz

Dave Chinner March 25, 2015, 8 p.m. UTC | #12

On Wed, Mar 25, 2015 at 12:19:50PM +0200, Boaz Harrosh wrote:
> On 03/25/2015 11:29 AM, Dave Chinner wrote:
> > On Wed, Mar 25, 2015 at 10:10:31AM +0200, Boaz Harrosh wrote:
> >> On 03/25/2015 04:22 AM, Dave Chinner wrote:
> >>> On Tue, Mar 24, 2015 at 08:14:59AM +0200, Boaz Harrosh wrote:
> >> <>
> <>
> >> The sync does happen, .fsync of the FS is called on each
> >> file just as if the user called it. If this is broken it just
> >> needs to be fixed there at the .fsync vector. POSIX mandate
> >> persistence at .fsync so at the vfs layer we rely on that.
> > 
> > right now, the filesystems will see that there are no dirty pages
> > on the inode, and then just sync the inode metadata. They will do
> > nothing else as filesystems are not aware of CPU cachelines at all.
> > 
> 
> Sigh yes. There is this bug. And I am sitting on a wide fix for this.
> 
> The strategy is. All Kernel writes are done with a new copy_user_nt.
> NT stands for none-temporal. This shows 20% improvements since cachelines
> need not be fetched when written too.

That's unenforcable for mmap writes from userspace. And those are
the writes that will trigger the dirty write mapping problem.

> >> And because of that nothing turned the
> >> user mappings to read only. This is what I do here but
> >> instead of write-protecting I just unmap because it is
> >> easier for me to code it.
> > 
> > That doesn't mean it is the correct solution.
> 
> Please note that even if we properly .fsync cachlines the page-faults
> are orthogonal to this. There is no point in making mmapped dax pages
> read-only after every .fsync and pay a page-fault. We should leave them
> mapped has is. The only place that we need page protection is at freeze
> time.

Actually, current behaviour of filesystems is that fsync cleans all
the pages in the range, and means all the mappings are marked
read-only and so we get new calls into .page_mkwrite when write
faults occur. We need that .page_mkwrite call to be able to a)
update the mtime of the inode, and b) mark the inode "data dirty" so
that fsync knows it needs to do something....

Hence I'd much prefer we start with identical behaviour to normal
files, then we can optimise from a sane start point when write page
faults show up as a performance problem. i.e. Correctness first,
performance second.

Cheers,

Dave.

Dave Chinner March 25, 2015, 8:05 p.m. UTC | #13

On Wed, Mar 25, 2015 at 12:40:44PM +0200, Boaz Harrosh wrote:
> On 03/25/2015 11:41 AM, Dave Chinner wrote:
> > On Wed, Mar 25, 2015 at 10:31:22AM +0200, Boaz Harrosh wrote:
> >> On 03/25/2015 04:26 AM, Dave Chinner wrote:
> <>
> >> sync and fsync should and will work correctly, but this does not
> >> solve our problem. because what turns pages to read-only is the
> >> writeback. And we do not have this in dax. Therefore we need to
> >> do this here as a special case.
> > 
> > We can still use exactly the same dirty tracking as we use for data
> > writeback. The difference is that we don't need to go through all
> > teh page writeback; we can just flush the CPU caches and mark all
> > the mappings clean, then clear the I_DIRTY_PAGES flag and move on to
> > inode writeback....
> > 
> 
> I see what you mean. the sb wide sync will not step into mmaped inodes
> and fsync them.
> 
> If we go my way and write NT (None Temporal) style in Kernel.
> NT instructions exist since xeon and all the Intel iX core CPUs have
> them. In tests we conducted doing xeon NT-writes vs
> regular-writes-and-cl_flush at .fsync showed minimum of 20% improvement.
> That is on very large IOs. On 4k IOs it was even better.

As I said before, relying on specific instructions is a non-starter
for mmap writes, and that's the problem we need to solve here.

> It looks like you have a much better picture in your mind how to
> fit this properly at the inode-dirty picture. Can you attempt a rough draft?
> 
> If we are going the NT way. Then we can only I_DIRTY_ track the mmaped
> inodes. For me this is really scary because I do not want to trigger
> any writeback threads. If you could please draw me an outline (or write
> something up ;-)) it would be great.

Writeback threads are not used for fsync - they are used for sync
and background writeback. They are already active on DAX filesystems
that track dirty inodes on the VFS superblock, as this is the way
inodes are written back on some filesystems.

Cheers,

Dave.

Boaz Harrosh March 26, 2015, 8:02 a.m. UTC | #14

On 03/25/2015 10:00 PM, Dave Chinner wrote:
> On Wed, Mar 25, 2015 at 12:19:50PM +0200, Boaz Harrosh wrote:
>> On 03/25/2015 11:29 AM, Dave Chinner wrote:
>>> On Wed, Mar 25, 2015 at 10:10:31AM +0200, Boaz Harrosh wrote:
>>>> On 03/25/2015 04:22 AM, Dave Chinner wrote:
>>>>> On Tue, Mar 24, 2015 at 08:14:59AM +0200, Boaz Harrosh wrote:
>>>> <>
>> <>
>>>> The sync does happen, .fsync of the FS is called on each
>>>> file just as if the user called it. If this is broken it just
>>>> needs to be fixed there at the .fsync vector. POSIX mandate
>>>> persistence at .fsync so at the vfs layer we rely on that.
>>>
>>> right now, the filesystems will see that there are no dirty pages
>>> on the inode, and then just sync the inode metadata. They will do
>>> nothing else as filesystems are not aware of CPU cachelines at all.
>>>
>>
>> Sigh yes. There is this bug. And I am sitting on a wide fix for this.
>>
>> The strategy is. All Kernel writes are done with a new copy_user_nt.
>> NT stands for none-temporal. This shows 20% improvements since cachelines
>> need not be fetched when written too.
> 
> That's unenforcable for mmap writes from userspace. And those are
> the writes that will trigger the dirty write mapping problem.
> 

So for them I was thinking of just doing the .fsync on every
unmap (ie vm_operations_struct->close)

So now we know that only inodes that have an active vm mapping
are in need of sync.

>>>> And because of that nothing turned the
>>>> user mappings to read only. This is what I do here but
>>>> instead of write-protecting I just unmap because it is
>>>> easier for me to code it.
>>>
>>> That doesn't mean it is the correct solution.
>>
>> Please note that even if we properly .fsync cachlines the page-faults
>> are orthogonal to this. There is no point in making mmapped dax pages
>> read-only after every .fsync and pay a page-fault. We should leave them
>> mapped has is. The only place that we need page protection is at freeze
>> time.
> 
> Actually, current behaviour of filesystems is that fsync cleans all
> the pages in the range, and means all the mappings are marked
> read-only and so we get new calls into .page_mkwrite when write
> faults occur. We need that .page_mkwrite call to be able to a)
> update the mtime of the inode, and b) mark the inode "data dirty" so
> that fsync knows it needs to do something....
> 
> Hence I'd much prefer we start with identical behaviour to normal
> files, then we can optimise from a sane start point when write page
> faults show up as a performance problem. i.e. Correctness first,
> performance second.
> 

OK, (you see when you speak slow I understand fast ;-)). I agree then
I'll see if I can schedule some time for this. My boss will be very
angry with me about this. But I will need help please, and some hands
holding. Unless someone else volunteers to work on this ?

> Cheers,
> Dave.
> 

Thanks
Boaz

[3/3] RFC: dax: dax_prepare_freeze

Commit Message

Comments

Patch