diff mbox series

[RFC,v2,5/8] ovl: mark overlayfs' inode dirty on shared writable mmap

Message ID 20201025034117.4918-6-cgxu519@mykernel.net (mailing list archive)
State New
Headers show
Series implement containerized syncfs for overlayfs | expand

Commit Message

Chengguang Xu Oct. 25, 2020, 3:41 a.m. UTC
Overlayfs cannot be notified when mmapped area gets dirty,
so we need to proactively mark inode dirty in ->mmap operation.

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
---
 fs/overlayfs/file.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Jan Kara Nov. 2, 2020, 5:30 p.m. UTC | #1
On Sun 25-10-20 11:41:14, Chengguang Xu wrote:
> Overlayfs cannot be notified when mmapped area gets dirty,
> so we need to proactively mark inode dirty in ->mmap operation.
> 
> Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
> ---
>  fs/overlayfs/file.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> index efccb7c1f9bc..cd6fcdfd81a9 100644
> --- a/fs/overlayfs/file.c
> +++ b/fs/overlayfs/file.c
> @@ -486,6 +486,10 @@ static int ovl_mmap(struct file *file, struct vm_area_struct *vma)
>  		/* Drop reference count from new vm_file value */
>  		fput(realfile);
>  	} else {
> +		if (vma->vm_flags & (VM_SHARED|VM_MAYSHARE) &&
> +		    vma->vm_flags & (VM_WRITE|VM_MAYWRITE))
> +			ovl_mark_inode_dirty(file_inode(file));
> +

But does this work reliably? I mean once writeback runs, your inode (as
well as upper inode) is cleaned. Then a page fault comes so file has dirty
pages again and would need flushing but overlayfs inode stays clean? Am I
missing something?

								Honza
Chengguang Xu Nov. 4, 2020, 11:54 a.m. UTC | #2
---- 在 星期二, 2020-11-03 01:30:52 Jan Kara <jack@suse.cz> 撰写 ----
 > On Sun 25-10-20 11:41:14, Chengguang Xu wrote:
 > > Overlayfs cannot be notified when mmapped area gets dirty,
 > > so we need to proactively mark inode dirty in ->mmap operation.
 > > 
 > > Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
 > > ---
 > >  fs/overlayfs/file.c | 4 ++++
 > >  1 file changed, 4 insertions(+)
 > > 
 > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
 > > index efccb7c1f9bc..cd6fcdfd81a9 100644
 > > --- a/fs/overlayfs/file.c
 > > +++ b/fs/overlayfs/file.c
 > > @@ -486,6 +486,10 @@ static int ovl_mmap(struct file *file, struct vm_area_struct *vma)
 > >          /* Drop reference count from new vm_file value */
 > >          fput(realfile);
 > >      } else {
 > > +        if (vma->vm_flags & (VM_SHARED|VM_MAYSHARE) &&
 > > +            vma->vm_flags & (VM_WRITE|VM_MAYWRITE))
 > > +            ovl_mark_inode_dirty(file_inode(file));
 > > +
 > 
 > But does this work reliably? I mean once writeback runs, your inode (as
 > well as upper inode) is cleaned. Then a page fault comes so file has dirty
 > pages again and would need flushing but overlayfs inode stays clean? Am I
 > missing something?
 > 

Yeah, this is key point of this approach, in order to  fix the issue I explicitly set 
I_DIRTY_SYNC flag in ovl_mark_inode_dirty(), so what i mean is during writeback
we will call into ->write_inode() by this flag(I_DIRTY_SYNC) and at that place
we get chance to check mapping and re-dirty overlay's inode. The code logic
like below in ovl_write_inode().

    if (mapping_writably_mapped(upper->i_mapping) ||
         mapping_tagged(upper->i_mapping, PAGECACHE_TAG_WRITEBACK))
                 iflag |= I_DIRTY_PAGES; 




Thanks,
Chengguang
Jan Kara Nov. 5, 2020, 2:03 p.m. UTC | #3
On Wed 04-11-20 19:54:03, Chengguang Xu wrote:
>  ---- 在 星期二, 2020-11-03 01:30:52 Jan Kara <jack@suse.cz> 撰写 ----
>  > On Sun 25-10-20 11:41:14, Chengguang Xu wrote:
>  > > Overlayfs cannot be notified when mmapped area gets dirty,
>  > > so we need to proactively mark inode dirty in ->mmap operation.
>  > > 
>  > > Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
>  > > ---
>  > >  fs/overlayfs/file.c | 4 ++++
>  > >  1 file changed, 4 insertions(+)
>  > > 
>  > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
>  > > index efccb7c1f9bc..cd6fcdfd81a9 100644
>  > > --- a/fs/overlayfs/file.c
>  > > +++ b/fs/overlayfs/file.c
>  > > @@ -486,6 +486,10 @@ static int ovl_mmap(struct file *file, struct vm_area_struct *vma)
>  > >          /* Drop reference count from new vm_file value */
>  > >          fput(realfile);
>  > >      } else {
>  > > +        if (vma->vm_flags & (VM_SHARED|VM_MAYSHARE) &&
>  > > +            vma->vm_flags & (VM_WRITE|VM_MAYWRITE))
>  > > +            ovl_mark_inode_dirty(file_inode(file));
>  > > +
>  > 
>  > But does this work reliably? I mean once writeback runs, your inode (as
>  > well as upper inode) is cleaned. Then a page fault comes so file has dirty
>  > pages again and would need flushing but overlayfs inode stays clean? Am I
>  > missing something?
>  > 
> 
> Yeah, this is key point of this approach, in order to  fix the issue I
> explicitly set I_DIRTY_SYNC flag in ovl_mark_inode_dirty(), so what i
> mean is during writeback we will call into ->write_inode() by this
> flag(I_DIRTY_SYNC) and at that place we get chance to check mapping and
> re-dirty overlay's inode. The code logic like below in ovl_write_inode().
> 
>     if (mapping_writably_mapped(upper->i_mapping) ||
>          mapping_tagged(upper->i_mapping, PAGECACHE_TAG_WRITEBACK))
>                  iflag |= I_DIRTY_PAGES; 

OK, but suppose the upper mapping is clean at this moment (upper inode has
been fully written out for whatever reason, but it is still mapped) so your
overlayfs inode becomes clean as well. Then I don't see a mechanism which
would make your overlayfs inode dirty again when a write to mmap happens,
set_page_dirty() will end up marking upper inode with I_DIRTY_PAGES flag.

Note that ovl_mmap() gets called only at mmap(2) syscall time but then
pages get faulted in, dirtied, cleaned fully at discretion of the mm
/ writeback subsystem.

								Honza
Amir Goldstein Nov. 5, 2020, 2:21 p.m. UTC | #4
On Thu, Nov 5, 2020 at 4:03 PM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 04-11-20 19:54:03, Chengguang Xu wrote:
> >  ---- 在 星期二, 2020-11-03 01:30:52 Jan Kara <jack@suse.cz> 撰写 ----
> >  > On Sun 25-10-20 11:41:14, Chengguang Xu wrote:
> >  > > Overlayfs cannot be notified when mmapped area gets dirty,
> >  > > so we need to proactively mark inode dirty in ->mmap operation.
> >  > >
> >  > > Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
> >  > > ---
> >  > >  fs/overlayfs/file.c | 4 ++++
> >  > >  1 file changed, 4 insertions(+)
> >  > >
> >  > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> >  > > index efccb7c1f9bc..cd6fcdfd81a9 100644
> >  > > --- a/fs/overlayfs/file.c
> >  > > +++ b/fs/overlayfs/file.c
> >  > > @@ -486,6 +486,10 @@ static int ovl_mmap(struct file *file, struct vm_area_struct *vma)
> >  > >          /* Drop reference count from new vm_file value */
> >  > >          fput(realfile);
> >  > >      } else {
> >  > > +        if (vma->vm_flags & (VM_SHARED|VM_MAYSHARE) &&
> >  > > +            vma->vm_flags & (VM_WRITE|VM_MAYWRITE))
> >  > > +            ovl_mark_inode_dirty(file_inode(file));
> >  > > +
> >  >
> >  > But does this work reliably? I mean once writeback runs, your inode (as
> >  > well as upper inode) is cleaned. Then a page fault comes so file has dirty
> >  > pages again and would need flushing but overlayfs inode stays clean? Am I
> >  > missing something?
> >  >
> >
> > Yeah, this is key point of this approach, in order to  fix the issue I
> > explicitly set I_DIRTY_SYNC flag in ovl_mark_inode_dirty(), so what i
> > mean is during writeback we will call into ->write_inode() by this
> > flag(I_DIRTY_SYNC) and at that place we get chance to check mapping and
> > re-dirty overlay's inode. The code logic like below in ovl_write_inode().
> >
> >     if (mapping_writably_mapped(upper->i_mapping) ||
> >          mapping_tagged(upper->i_mapping, PAGECACHE_TAG_WRITEBACK))
> >                  iflag |= I_DIRTY_PAGES;
>
> OK, but suppose the upper mapping is clean at this moment (upper inode has
> been fully written out for whatever reason, but it is still mapped) so your
> overlayfs inode becomes clean as well. Then I don't see a mechanism which
> would make your overlayfs inode dirty again when a write to mmap happens,
> set_page_dirty() will end up marking upper inode with I_DIRTY_PAGES flag.
>
> Note that ovl_mmap() gets called only at mmap(2) syscall time but then
> pages get faulted in, dirtied, cleaned fully at discretion of the mm
> / writeback subsystem.
>

Perhaps I will add some background.

What I suggested was to maintain a "suspect list" in addition to
the dirty ovl inodes.

ovl inode is added to the suspect list on mmap (writable) and removed
from the suspect list on release() flush() or on sync_fs() if real inode is no
longer writably mapped.

There was another variant where ovl inode is added to suspect list on open
for write and removed from suspect list on release() flush() or sync_fs()
if real inode is not inode_is_open_for_write().

In both cases the list will have inodes whose real is not dirty, but
in both cases
the list shouldn't be terribly large to traverse on sync_fs().

Chengguang tried to implement the idea without an actual list by
re-dirtying the "suspect" inodes on every write_inode(), but I personally have
no idea if his idea works.

I think we can resort to using an actual suspect list if you say that it
cannot work like this?

Thanks,
Amir.
Jan Kara Nov. 5, 2020, 3:54 p.m. UTC | #5
On Thu 05-11-20 16:21:27, Amir Goldstein wrote:
> On Thu, Nov 5, 2020 at 4:03 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Wed 04-11-20 19:54:03, Chengguang Xu wrote:
> > >  ---- 在 星期二, 2020-11-03 01:30:52 Jan Kara <jack@suse.cz> 撰写 ----
> > >  > On Sun 25-10-20 11:41:14, Chengguang Xu wrote:
> > >  > > Overlayfs cannot be notified when mmapped area gets dirty,
> > >  > > so we need to proactively mark inode dirty in ->mmap operation.
> > >  > >
> > >  > > Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
> > >  > > ---
> > >  > >  fs/overlayfs/file.c | 4 ++++
> > >  > >  1 file changed, 4 insertions(+)
> > >  > >
> > >  > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> > >  > > index efccb7c1f9bc..cd6fcdfd81a9 100644
> > >  > > --- a/fs/overlayfs/file.c
> > >  > > +++ b/fs/overlayfs/file.c
> > >  > > @@ -486,6 +486,10 @@ static int ovl_mmap(struct file *file, struct vm_area_struct *vma)
> > >  > >          /* Drop reference count from new vm_file value */
> > >  > >          fput(realfile);
> > >  > >      } else {
> > >  > > +        if (vma->vm_flags & (VM_SHARED|VM_MAYSHARE) &&
> > >  > > +            vma->vm_flags & (VM_WRITE|VM_MAYWRITE))
> > >  > > +            ovl_mark_inode_dirty(file_inode(file));
> > >  > > +
> > >  >
> > >  > But does this work reliably? I mean once writeback runs, your inode (as
> > >  > well as upper inode) is cleaned. Then a page fault comes so file has dirty
> > >  > pages again and would need flushing but overlayfs inode stays clean? Am I
> > >  > missing something?
> > >  >
> > >
> > > Yeah, this is key point of this approach, in order to  fix the issue I
> > > explicitly set I_DIRTY_SYNC flag in ovl_mark_inode_dirty(), so what i
> > > mean is during writeback we will call into ->write_inode() by this
> > > flag(I_DIRTY_SYNC) and at that place we get chance to check mapping and
> > > re-dirty overlay's inode. The code logic like below in ovl_write_inode().
> > >
> > >     if (mapping_writably_mapped(upper->i_mapping) ||
> > >          mapping_tagged(upper->i_mapping, PAGECACHE_TAG_WRITEBACK))
> > >                  iflag |= I_DIRTY_PAGES;
> >
> > OK, but suppose the upper mapping is clean at this moment (upper inode has
> > been fully written out for whatever reason, but it is still mapped) so your
> > overlayfs inode becomes clean as well. Then I don't see a mechanism which
> > would make your overlayfs inode dirty again when a write to mmap happens,
> > set_page_dirty() will end up marking upper inode with I_DIRTY_PAGES flag.
> >
> > Note that ovl_mmap() gets called only at mmap(2) syscall time but then
> > pages get faulted in, dirtied, cleaned fully at discretion of the mm
> > / writeback subsystem.
> >
> 
> Perhaps I will add some background.
> 
> What I suggested was to maintain a "suspect list" in addition to
> the dirty ovl inodes.
> 
> ovl inode is added to the suspect list on mmap (writable) and removed
> from the suspect list on release() flush() or on sync_fs() if real inode is no
> longer writably mapped.
> 
> There was another variant where ovl inode is added to suspect list on open
> for write and removed from suspect list on release() flush() or sync_fs()
> if real inode is not inode_is_open_for_write().
> 
> In both cases the list will have inodes whose real is not dirty, but
> in both cases
> the list shouldn't be terribly large to traverse on sync_fs().
> 
> Chengguang tried to implement the idea without an actual list by
> re-dirtying the "suspect" inodes on every write_inode(), but I personally have
> no idea if his idea works.
> 
> I think we can resort to using an actual suspect list if you say that it
> cannot work like this?

Yeah, the suspect list (i.e., additional list of inodes to check on sync)
you describe should work fine. Also the "keep suspect inode dirty" idea
of Chengguang could work fine but we'd have to use something like
inode_is_open_for_write() or inode_is_writeably_mapped() (which would need
to be implemented but it should be easy vma_interval_tree_foreach() walk
checking each found VMA for vma->vm_flags & VM_WRITE) for checking whether
inode should be redirtied or not.

								Honza
Chengguang Xu Nov. 6, 2020, 2:41 a.m. UTC | #6
---- 在 星期四, 2020-11-05 23:54:34 Jan Kara <jack@suse.cz> 撰写 ----
 > On Thu 05-11-20 16:21:27, Amir Goldstein wrote:
 > > On Thu, Nov 5, 2020 at 4:03 PM Jan Kara <jack@suse.cz> wrote:
 > > >
 > > > On Wed 04-11-20 19:54:03, Chengguang Xu wrote:
 > > > >  ---- 在 星期二, 2020-11-03 01:30:52 Jan Kara <jack@suse.cz> 撰写 ----
 > > > >  > On Sun 25-10-20 11:41:14, Chengguang Xu wrote:
 > > > >  > > Overlayfs cannot be notified when mmapped area gets dirty,
 > > > >  > > so we need to proactively mark inode dirty in ->mmap operation.
 > > > >  > >
 > > > >  > > Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
 > > > >  > > ---
 > > > >  > >  fs/overlayfs/file.c | 4 ++++
 > > > >  > >  1 file changed, 4 insertions(+)
 > > > >  > >
 > > > >  > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
 > > > >  > > index efccb7c1f9bc..cd6fcdfd81a9 100644
 > > > >  > > --- a/fs/overlayfs/file.c
 > > > >  > > +++ b/fs/overlayfs/file.c
 > > > >  > > @@ -486,6 +486,10 @@ static int ovl_mmap(struct file *file, struct vm_area_struct *vma)
 > > > >  > >          /* Drop reference count from new vm_file value */
 > > > >  > >          fput(realfile);
 > > > >  > >      } else {
 > > > >  > > +        if (vma->vm_flags & (VM_SHARED|VM_MAYSHARE) &&
 > > > >  > > +            vma->vm_flags & (VM_WRITE|VM_MAYWRITE))
 > > > >  > > +            ovl_mark_inode_dirty(file_inode(file));
 > > > >  > > +
 > > > >  >
 > > > >  > But does this work reliably? I mean once writeback runs, your inode (as
 > > > >  > well as upper inode) is cleaned. Then a page fault comes so file has dirty
 > > > >  > pages again and would need flushing but overlayfs inode stays clean? Am I
 > > > >  > missing something?
 > > > >  >
 > > > >
 > > > > Yeah, this is key point of this approach, in order to  fix the issue I
 > > > > explicitly set I_DIRTY_SYNC flag in ovl_mark_inode_dirty(), so what i
 > > > > mean is during writeback we will call into ->write_inode() by this
 > > > > flag(I_DIRTY_SYNC) and at that place we get chance to check mapping and
 > > > > re-dirty overlay's inode. The code logic like below in ovl_write_inode().
 > > > >
 > > > >     if (mapping_writably_mapped(upper->i_mapping) ||
 > > > >          mapping_tagged(upper->i_mapping, PAGECACHE_TAG_WRITEBACK))
 > > > >                  iflag |= I_DIRTY_PAGES;
 > > >
 > > > OK, but suppose the upper mapping is clean at this moment (upper inode has
 > > > been fully written out for whatever reason, but it is still mapped) so your
 > > > overlayfs inode becomes clean as well. Then I don't see a mechanism which
 > > > would make your overlayfs inode dirty again when a write to mmap happens,
 > > > set_page_dirty() will end up marking upper inode with I_DIRTY_PAGES flag.
 > > >
 > > > Note that ovl_mmap() gets called only at mmap(2) syscall time but then
 > > > pages get faulted in, dirtied, cleaned fully at discretion of the mm
 > > > / writeback subsystem.
 > > >
 > > 
 > > Perhaps I will add some background.
 > > 
 > > What I suggested was to maintain a "suspect list" in addition to
 > > the dirty ovl inodes.
 > > 
 > > ovl inode is added to the suspect list on mmap (writable) and removed
 > > from the suspect list on release() flush() or on sync_fs() if real inode is no
 > > longer writably mapped.
 > > 
 > > There was another variant where ovl inode is added to suspect list on open
 > > for write and removed from suspect list on release() flush() or sync_fs()
 > > if real inode is not inode_is_open_for_write().
 > > 
 > > In both cases the list will have inodes whose real is not dirty, but
 > > in both cases
 > > the list shouldn't be terribly large to traverse on sync_fs().
 > > 
 > > Chengguang tried to implement the idea without an actual list by
 > > re-dirtying the "suspect" inodes on every write_inode(), but I personally have
 > > no idea if his idea works.
 > > 
 > > I think we can resort to using an actual suspect list if you say that it
 > > cannot work like this?
 > 
 > Yeah, the suspect list (i.e., additional list of inodes to check on sync)
 > you describe should work fine. 

I think this solution still has the problem we have met in below thread[1]
The main problem is the state combination of clean overlayfs' inode && dirty upper inode.
 
[1] https://www.spinics.net/lists/linux-unionfs/msg07448.html

 > Also the "keep suspect inode dirty" idea
 > of Chengguang could work fine but we'd have to use something like
 > inode_is_open_for_write() or inode_is_writeably_mapped() (which would need
 > to be implemented but it should be easy vma_interval_tree_foreach() walk
 > checking each found VMA for vma->vm_flags & VM_WRITE) for checking whether
 > inode should be redirtied or not.
 > 

I'm curious that isn't  it enough to check  i_mmap_writable by mapping_writably_mapped() ?
Am I missing something?


Thanks,
Chengguang
Jan Kara Nov. 6, 2020, 8:50 a.m. UTC | #7
On Fri 06-11-20 10:41:44, Chengguang Xu wrote:
>  ---- 在 星期四, 2020-11-05 23:54:34 Jan Kara <jack@suse.cz> 撰写 ----
>  > On Thu 05-11-20 16:21:27, Amir Goldstein wrote:
>  > > On Thu, Nov 5, 2020 at 4:03 PM Jan Kara <jack@suse.cz> wrote:
>  > > >
>  > > > On Wed 04-11-20 19:54:03, Chengguang Xu wrote:
>  > > > >  ---- 在 星期二, 2020-11-03 01:30:52 Jan Kara <jack@suse.cz> 撰写 ----
>  > > > >  > On Sun 25-10-20 11:41:14, Chengguang Xu wrote:
>  > > > >  > > Overlayfs cannot be notified when mmapped area gets dirty,
>  > > > >  > > so we need to proactively mark inode dirty in ->mmap operation.
>  > > > >  > >
>  > > > >  > > Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
>  > > > >  > > ---
>  > > > >  > >  fs/overlayfs/file.c | 4 ++++
>  > > > >  > >  1 file changed, 4 insertions(+)
>  > > > >  > >
>  > > > >  > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
>  > > > >  > > index efccb7c1f9bc..cd6fcdfd81a9 100644
>  > > > >  > > --- a/fs/overlayfs/file.c
>  > > > >  > > +++ b/fs/overlayfs/file.c
>  > > > >  > > @@ -486,6 +486,10 @@ static int ovl_mmap(struct file *file, struct vm_area_struct *vma)
>  > > > >  > >          /* Drop reference count from new vm_file value */
>  > > > >  > >          fput(realfile);
>  > > > >  > >      } else {
>  > > > >  > > +        if (vma->vm_flags & (VM_SHARED|VM_MAYSHARE) &&
>  > > > >  > > +            vma->vm_flags & (VM_WRITE|VM_MAYWRITE))
>  > > > >  > > +            ovl_mark_inode_dirty(file_inode(file));
>  > > > >  > > +
>  > > > >  >
>  > > > >  > But does this work reliably? I mean once writeback runs, your inode (as
>  > > > >  > well as upper inode) is cleaned. Then a page fault comes so file has dirty
>  > > > >  > pages again and would need flushing but overlayfs inode stays clean? Am I
>  > > > >  > missing something?
>  > > > >  >
>  > > > >
>  > > > > Yeah, this is key point of this approach, in order to  fix the issue I
>  > > > > explicitly set I_DIRTY_SYNC flag in ovl_mark_inode_dirty(), so what i
>  > > > > mean is during writeback we will call into ->write_inode() by this
>  > > > > flag(I_DIRTY_SYNC) and at that place we get chance to check mapping and
>  > > > > re-dirty overlay's inode. The code logic like below in ovl_write_inode().
>  > > > >
>  > > > >     if (mapping_writably_mapped(upper->i_mapping) ||
>  > > > >          mapping_tagged(upper->i_mapping, PAGECACHE_TAG_WRITEBACK))
>  > > > >                  iflag |= I_DIRTY_PAGES;
>  > > >
>  > > > OK, but suppose the upper mapping is clean at this moment (upper inode has
>  > > > been fully written out for whatever reason, but it is still mapped) so your
>  > > > overlayfs inode becomes clean as well. Then I don't see a mechanism which
>  > > > would make your overlayfs inode dirty again when a write to mmap happens,
>  > > > set_page_dirty() will end up marking upper inode with I_DIRTY_PAGES flag.
>  > > >
>  > > > Note that ovl_mmap() gets called only at mmap(2) syscall time but then
>  > > > pages get faulted in, dirtied, cleaned fully at discretion of the mm
>  > > > / writeback subsystem.
>  > > >
>  > > 
>  > > Perhaps I will add some background.
>  > > 
>  > > What I suggested was to maintain a "suspect list" in addition to
>  > > the dirty ovl inodes.
>  > > 
>  > > ovl inode is added to the suspect list on mmap (writable) and removed
>  > > from the suspect list on release() flush() or on sync_fs() if real inode is no
>  > > longer writably mapped.
>  > > 
>  > > There was another variant where ovl inode is added to suspect list on open
>  > > for write and removed from suspect list on release() flush() or sync_fs()
>  > > if real inode is not inode_is_open_for_write().
>  > > 
>  > > In both cases the list will have inodes whose real is not dirty, but
>  > > in both cases
>  > > the list shouldn't be terribly large to traverse on sync_fs().
>  > > 
>  > > Chengguang tried to implement the idea without an actual list by
>  > > re-dirtying the "suspect" inodes on every write_inode(), but I personally have
>  > > no idea if his idea works.
>  > > 
>  > > I think we can resort to using an actual suspect list if you say that it
>  > > cannot work like this?
>  > 
>  > Yeah, the suspect list (i.e., additional list of inodes to check on sync)
>  > you describe should work fine. 
> 
> I think this solution still has the problem we have met in below thread[1]
> The main problem is the state combination of clean overlayfs' inode && dirty upper inode.

But I think the scheme Amir proposed and I detailed in my previous email
should prevent that state. Because while the inode is mapped, it will be
kept in the dirty list. So which scenario do you think would lead to clean
overlayfs inode and dirty upper inode?

> [1] https://www.spinics.net/lists/linux-unionfs/msg07448.html
> 
>  > Also the "keep suspect inode dirty" idea
>  > of Chengguang could work fine but we'd have to use something like
>  > inode_is_open_for_write() or inode_is_writeably_mapped() (which would need
>  > to be implemented but it should be easy vma_interval_tree_foreach() walk
>  > checking each found VMA for vma->vm_flags & VM_WRITE) for checking whether
>  > inode should be redirtied or not.
>  > 
> 
> I'm curious that isn't  it enough to check  i_mmap_writable by
> mapping_writably_mapped() ?  Am I missing something?

What is i_mmap_writeable? I've grepped the tree and didn't find anything
like that...

								Honza
Chengguang Xu Nov. 6, 2020, 9:47 a.m. UTC | #8
---- 在 星期五, 2020-11-06 16:50:23 Jan Kara <jack@suse.cz> 撰写 ----
 > On Fri 06-11-20 10:41:44, Chengguang Xu wrote:
 > >  ---- 在 星期四, 2020-11-05 23:54:34 Jan Kara <jack@suse.cz> 撰写 ----
 > >  > On Thu 05-11-20 16:21:27, Amir Goldstein wrote:
 > >  > > On Thu, Nov 5, 2020 at 4:03 PM Jan Kara <jack@suse.cz> wrote:
 > >  > > >
 > >  > > > On Wed 04-11-20 19:54:03, Chengguang Xu wrote:
 > >  > > > >  ---- 在 星期二, 2020-11-03 01:30:52 Jan Kara <jack@suse.cz> 撰写 ----
 > >  > > > >  > On Sun 25-10-20 11:41:14, Chengguang Xu wrote:
 > >  > > > >  > > Overlayfs cannot be notified when mmapped area gets dirty,
 > >  > > > >  > > so we need to proactively mark inode dirty in ->mmap operation.
 > >  > > > >  > >
 > >  > > > >  > > Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
 > >  > > > >  > > ---
 > >  > > > >  > >  fs/overlayfs/file.c | 4 ++++
 > >  > > > >  > >  1 file changed, 4 insertions(+)
 > >  > > > >  > >
 > >  > > > >  > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
 > >  > > > >  > > index efccb7c1f9bc..cd6fcdfd81a9 100644
 > >  > > > >  > > --- a/fs/overlayfs/file.c
 > >  > > > >  > > +++ b/fs/overlayfs/file.c
 > >  > > > >  > > @@ -486,6 +486,10 @@ static int ovl_mmap(struct file *file, struct vm_area_struct *vma)
 > >  > > > >  > >          /* Drop reference count from new vm_file value */
 > >  > > > >  > >          fput(realfile);
 > >  > > > >  > >      } else {
 > >  > > > >  > > +        if (vma->vm_flags & (VM_SHARED|VM_MAYSHARE) &&
 > >  > > > >  > > +            vma->vm_flags & (VM_WRITE|VM_MAYWRITE))
 > >  > > > >  > > +            ovl_mark_inode_dirty(file_inode(file));
 > >  > > > >  > > +
 > >  > > > >  >
 > >  > > > >  > But does this work reliably? I mean once writeback runs, your inode (as
 > >  > > > >  > well as upper inode) is cleaned. Then a page fault comes so file has dirty
 > >  > > > >  > pages again and would need flushing but overlayfs inode stays clean? Am I
 > >  > > > >  > missing something?
 > >  > > > >  >
 > >  > > > >
 > >  > > > > Yeah, this is key point of this approach, in order to  fix the issue I
 > >  > > > > explicitly set I_DIRTY_SYNC flag in ovl_mark_inode_dirty(), so what i
 > >  > > > > mean is during writeback we will call into ->write_inode() by this
 > >  > > > > flag(I_DIRTY_SYNC) and at that place we get chance to check mapping and
 > >  > > > > re-dirty overlay's inode. The code logic like below in ovl_write_inode().
 > >  > > > >
 > >  > > > >     if (mapping_writably_mapped(upper->i_mapping) ||
 > >  > > > >          mapping_tagged(upper->i_mapping, PAGECACHE_TAG_WRITEBACK))
 > >  > > > >                  iflag |= I_DIRTY_PAGES;
 > >  > > >
 > >  > > > OK, but suppose the upper mapping is clean at this moment (upper inode has
 > >  > > > been fully written out for whatever reason, but it is still mapped) so your
 > >  > > > overlayfs inode becomes clean as well. Then I don't see a mechanism which
 > >  > > > would make your overlayfs inode dirty again when a write to mmap happens,
 > >  > > > set_page_dirty() will end up marking upper inode with I_DIRTY_PAGES flag.
 > >  > > >
 > >  > > > Note that ovl_mmap() gets called only at mmap(2) syscall time but then
 > >  > > > pages get faulted in, dirtied, cleaned fully at discretion of the mm
 > >  > > > / writeback subsystem.
 > >  > > >
 > >  > > 
 > >  > > Perhaps I will add some background.
 > >  > > 
 > >  > > What I suggested was to maintain a "suspect list" in addition to
 > >  > > the dirty ovl inodes.
 > >  > > 
 > >  > > ovl inode is added to the suspect list on mmap (writable) and removed
 > >  > > from the suspect list on release() flush() or on sync_fs() if real inode is no
 > >  > > longer writably mapped.
 > >  > > 
 > >  > > There was another variant where ovl inode is added to suspect list on open
 > >  > > for write and removed from suspect list on release() flush() or sync_fs()
 > >  > > if real inode is not inode_is_open_for_write().
 > >  > > 
 > >  > > In both cases the list will have inodes whose real is not dirty, but
 > >  > > in both cases
 > >  > > the list shouldn't be terribly large to traverse on sync_fs().
 > >  > > 
 > >  > > Chengguang tried to implement the idea without an actual list by
 > >  > > re-dirtying the "suspect" inodes on every write_inode(), but I personally have
 > >  > > no idea if his idea works.
 > >  > > 
 > >  > > I think we can resort to using an actual suspect list if you say that it
 > >  > > cannot work like this?
 > >  > 
 > >  > Yeah, the suspect list (i.e., additional list of inodes to check on sync)
 > >  > you describe should work fine. 
 > > 
 > > I think this solution still has the problem we have met in below thread[1]
 > > The main problem is the state combination of clean overlayfs' inode && dirty upper inode.
 > 
 > But I think the scheme Amir proposed and I detailed in my previous email
 > should prevent that state. Because while the inode is mapped, it will be
 > kept in the dirty list. So which scenario do you think would lead to clean
 > overlayfs inode and dirty upper inode?

If keeping in the dirty list means making  overlayfs inode dirty, then
I think we don't need extra list for that, vfs itself has writeback list and
the solution will be exactly the same as mine(re-dirty) . Right?


 > 
 > > [1] https://www.spinics.net/lists/linux-unionfs/msg07448.html
 > > 
 > >  > Also the "keep suspect inode dirty" idea
 > >  > of Chengguang could work fine but we'd have to use something like
 > >  > inode_is_open_for_write() or inode_is_writeably_mapped() (which would need
 > >  > to be implemented but it should be easy vma_interval_tree_foreach() walk
 > >  > checking each found VMA for vma->vm_flags & VM_WRITE) for checking whether
 > >  > inode should be redirtied or not.
 > >  > 
 > > 
 > > I'm curious that isn't  it enough to check  i_mmap_writable by
 > > mapping_writably_mapped() ?  Am I missing something?
 > 
 > What is i_mmap_writeable? I've grepped the tree and didn't find anything
 > like that...
 > 

Maybe spelling mistake? The reason I check this is I'm afraid of the permission change of vma by mprotect(2).


Thanks,
Chenguang
diff mbox series

Patch

diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index efccb7c1f9bc..cd6fcdfd81a9 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -486,6 +486,10 @@  static int ovl_mmap(struct file *file, struct vm_area_struct *vma)
 		/* Drop reference count from new vm_file value */
 		fput(realfile);
 	} else {
+		if (vma->vm_flags & (VM_SHARED|VM_MAYSHARE) &&
+		    vma->vm_flags & (VM_WRITE|VM_MAYWRITE))
+			ovl_mark_inode_dirty(file_inode(file));
+
 		/* Drop reference count from previous vm_file value */
 		fput(file);
 	}