[RFC,1/5] fs: introduce notifier list for vfs inode

Message ID	20201010142355.741645-2-cgxu519@mykernel.net (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=Z9On=DR=vger.kernel.org=linux-fsdevel-owner@kernel.org> From: Chengguang Xu <cgxu519@mykernel.net> To: miklos@szeredi.hu, amir73il@gmail.com, jack@suse.cz Cc: linux-unionfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, Chengguang Xu <cgxu519@mykernel.net> Message-ID: <20201010142355.741645-2-cgxu519@mykernel.net> Subject: [RFC PATCH 1/5] fs: introduce notifier list for vfs inode Date: Sat, 10 Oct 2020 22:23:51 +0800 In-Reply-To: <20201010142355.741645-1-cgxu519@mykernel.net> References: <20201010142355.741645-1-cgxu519@mykernel.net> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf8 Precedence: bulk
Series	implement containerized syncfs for overlayfs \| expand [RFC,0/5] implement containerized syncfs for overlayfs [RFC,1/5] fs: introduce notifier list for vfs inode [RFC,2/5] fs: export symbol of writeback_single_inode() [RFC,3/5] ovl: setup overlayfs' private bdi [RFC,4/5] ovl: monitor marking dirty activity of underlying upper inode [RFC,5/5] ovl: impement containerized syncfs for overlayfs

Chengguang Xu Oct. 10, 2020, 2:23 p.m. UTC

Currently there is no notification api for kernel about modification
of vfs inode, in some use cases like overlayfs, this kind of notification
will be very helpful to implement containerized syncfs functionality.
As the first attempt, we introduce marking inode dirty notification so that
overlay's inode could mark itself dirty as well and then only sync dirty
overlay inode while syncfs.

Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
---
 fs/fs-writeback.c  | 4 ++++
 fs/inode.c         | 5 +++++
 include/linux/fs.h | 6 ++++++
 3 files changed, 15 insertions(+)

Jan Kara Oct. 14, 2020, 4:15 p.m. UTC | #1

On Sat 10-10-20 22:23:51, Chengguang Xu wrote:
> Currently there is no notification api for kernel about modification
> of vfs inode, in some use cases like overlayfs, this kind of notification
> will be very helpful to implement containerized syncfs functionality.
> As the first attempt, we introduce marking inode dirty notification so that
> overlay's inode could mark itself dirty as well and then only sync dirty
> overlay inode while syncfs.
> 
> Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>

So I like how the patch set is elegant however growing struct inode for
everybody by struct blocking_notifier_head (which is rwsem + pointer) is
rather harsh just for this overlayfs functionality... Ideally this should
induce no overhead on struct inode if the filesystem is not covered by
overlayfs. So I'd rather place some external structure into the superblock
that would get allocated on the first use that would track dirty notifications
for each inode. I would not generalize the code for more possible
notifications at this point.

Also now that I'm thinking about it can there be multiple overlayfs inodes
for one upper inode? If not, then the mechanism of dirtiness propagation
could be much simpler - it seems we could be able to just lookup
corresponding overlayfs inode based on upper inode and then mark it dirty
(but this would need to be verified by people more familiar with
overlayfs). So all we'd need to know for this is the superblock of the
overlayfs that's covering given upper filesystem...

								Honza
> ---
>  fs/fs-writeback.c  | 4 ++++
>  fs/inode.c         | 5 +++++
>  include/linux/fs.h | 6 ++++++
>  3 files changed, 15 insertions(+)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 1492271..657cceb 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -2246,9 +2246,13 @@ void __mark_inode_dirty(struct inode *inode, int flags)
>  {
>  	struct super_block *sb = inode->i_sb;
>  	int dirtytime;
> +	int copy_flags = flags;
>  
>  	trace_writeback_mark_inode_dirty(inode, flags);
>  
> +	blocking_notifier_call_chain(
> +		&inode->notifier_lists[MARK_INODE_DIRTY_NOTIFIER],
> +		0, &copy_flags);
>  	/*
>  	 * Don't do this for I_DIRTY_PAGES - that doesn't actually
>  	 * dirty the inode itself
> diff --git a/fs/inode.c b/fs/inode.c
> index 72c4c34..84e82db 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -388,12 +388,17 @@ void address_space_init_once(struct address_space *mapping)
>   */
>  void inode_init_once(struct inode *inode)
>  {
> +	int i;
> +
>  	memset(inode, 0, sizeof(*inode));
>  	INIT_HLIST_NODE(&inode->i_hash);
>  	INIT_LIST_HEAD(&inode->i_devices);
>  	INIT_LIST_HEAD(&inode->i_io_list);
>  	INIT_LIST_HEAD(&inode->i_wb_list);
>  	INIT_LIST_HEAD(&inode->i_lru);
> +	for (i = 0; i < INODE_MAX_NOTIFIER; i++)
> +		BLOCKING_INIT_NOTIFIER_HEAD(
> +			&inode->notifier_lists[MARK_INODE_DIRTY_NOTIFIER]);
>  	__address_space_init_once(&inode->i_data);
>  	i_size_ordered_init(inode);
>  }
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 7519ae0..42f0750 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -607,6 +607,11 @@ static inline void mapping_allow_writable(struct address_space *mapping)
>  
>  struct fsnotify_mark_connector;
>  
> +enum {
> +	MARK_INODE_DIRTY_NOTIFIER,
> +	INODE_MAX_NOTIFIER,
> +};
> +
>  /*
>   * Keep mostly read-only and often accessed (especially for
>   * the RCU path lookup and 'stat' data) fields at the beginning
> @@ -723,6 +728,7 @@ struct inode {
>  #endif
>  
>  	void			*i_private; /* fs or device private pointer */
> +	struct	blocking_notifier_head notifier_lists[INODE_MAX_NOTIFIER];
>  } __randomize_layout;
>  
>  struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode);
> -- 
> 1.8.3.1
> 
>

Chengguang Xu Oct. 15, 2020, 3:03 a.m. UTC | #2

---- 在 星期四, 2020-10-15 00:15:38 Jan Kara <jack@suse.cz> 撰写 ----
 > On Sat 10-10-20 22:23:51, Chengguang Xu wrote:
 > > Currently there is no notification api for kernel about modification
 > > of vfs inode, in some use cases like overlayfs, this kind of notification
 > > will be very helpful to implement containerized syncfs functionality.
 > > As the first attempt, we introduce marking inode dirty notification so that
 > > overlay's inode could mark itself dirty as well and then only sync dirty
 > > overlay inode while syncfs.
 > > 
 > > Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
 > 
 > So I like how the patch set is elegant however growing struct inode for
 > everybody by struct blocking_notifier_head (which is rwsem + pointer) is
 > rather harsh just for this overlayfs functionality... Ideally this should
 > induce no overhead on struct inode if the filesystem is not covered by
 > overlayfs. So I'd rather place some external structure into the superblock
 > that would get allocated on the first use that would track dirty notifications
 > for each inode. I would not generalize the code for more possible
 > notifications at this point.
 > 
 > Also now that I'm thinking about it can there be multiple overlayfs inodes
 > for one upper inode? If not, then the mechanism of dirtiness propagation

One upper inode only belongs to one overlayfs inode. However, in practice
one upper fs may contains hundreds or even thousands of overlayfs instances.

 > could be much simpler - it seems we could be able to just lookup
 > corresponding overlayfs inode based on upper inode and then mark it dirty
 > (but this would need to be verified by people more familiar with
 > overlayfs). So all we'd need to know for this is the superblock of the
 > overlayfs that's covering given upper filesystem...
 > 

So the difficulty is how we get overlayfs inode efficiently from upper inode,
it seems if we don't have additional info of upper inode to indicate which
overlayfs it belongs to,  then the lookup of corresponding overlayfs inode
will be quite expensive and probably impact write performance. 

Is that possible we extend inotify infrastructure slightly to notify both
user space and kernel component?


Thanks,
Chengguang

Al Viro Oct. 15, 2020, 3:25 a.m. UTC | #3

On Sat, Oct 10, 2020 at 10:23:51PM +0800, Chengguang Xu wrote:
> Currently there is no notification api for kernel about modification
> of vfs inode, in some use cases like overlayfs, this kind of notification
> will be very helpful to implement containerized syncfs functionality.
> As the first attempt, we introduce marking inode dirty notification so that
> overlay's inode could mark itself dirty as well and then only sync dirty
> overlay inode while syncfs.

Who's responsible for removing the crap from notifier chain?  And how does
that affect the lifetime of inode?

Chengguang Xu Oct. 15, 2020, 3:42 a.m. UTC | #4

---- 在 星期四, 2020-10-15 11:25:01 Al Viro <viro@zeniv.linux.org.uk> 撰写 ----
 > On Sat, Oct 10, 2020 at 10:23:51PM +0800, Chengguang Xu wrote:
 > > Currently there is no notification api for kernel about modification
 > > of vfs inode, in some use cases like overlayfs, this kind of notification
 > > will be very helpful to implement containerized syncfs functionality.
 > > As the first attempt, we introduce marking inode dirty notification so that
 > > overlay's inode could mark itself dirty as well and then only sync dirty
 > > overlay inode while syncfs.
 > 
 > Who's responsible for removing the crap from notifier chain?  And how does
 > that affect the lifetime of inode?
 
In this case, overlayfs unregisters call back from the notifier chain of upper inode
when evicting it's own  inode. It will not affect the lifetime of upper inode because
overlayfs inode holds a reference of upper inode that means upper inode will not be
evicted while overlayfs inode is still alive.


Thanks,
Chengguang

Al Viro Oct. 15, 2020, 4:57 a.m. UTC | #5

On Thu, Oct 15, 2020 at 11:42:51AM +0800, Chengguang Xu wrote:
>  ---- 在 星期四, 2020-10-15 11:25:01 Al Viro <viro@zeniv.linux.org.uk> 撰写 ----
>  > On Sat, Oct 10, 2020 at 10:23:51PM +0800, Chengguang Xu wrote:
>  > > Currently there is no notification api for kernel about modification
>  > > of vfs inode, in some use cases like overlayfs, this kind of notification
>  > > will be very helpful to implement containerized syncfs functionality.
>  > > As the first attempt, we introduce marking inode dirty notification so that
>  > > overlay's inode could mark itself dirty as well and then only sync dirty
>  > > overlay inode while syncfs.
>  > 
>  > Who's responsible for removing the crap from notifier chain?  And how does
>  > that affect the lifetime of inode?
>  
> In this case, overlayfs unregisters call back from the notifier chain of upper inode
> when evicting it's own  inode. It will not affect the lifetime of upper inode because
> overlayfs inode holds a reference of upper inode that means upper inode will not be
> evicted while overlayfs inode is still alive.

Let me see if I've got it right:
	* your chain contains 1 (for upper inodes) or 0 (everything else, i.e. the
vast majority of inodes) recepients
	* recepient pins the inode for as long as the recepient exists

That looks like a massive overkill, especially since all you are propagating is
dirtying the suckers.  All you really need is one bit in your inode + hash table
indexed by the address of struct inode (well, middle bits thereof, as usual).
With entries embedded into overlayfs-private part of overlayfs inode.  And callback
to be called stored in that entry...

Amir Goldstein Oct. 15, 2020, 6:11 a.m. UTC | #6

On Thu, Oct 15, 2020 at 6:03 AM Chengguang Xu <cgxu519@mykernel.net> wrote:
>
>  ---- 在 星期四, 2020-10-15 00:15:38 Jan Kara <jack@suse.cz> 撰写 ----
>  > On Sat 10-10-20 22:23:51, Chengguang Xu wrote:
>  > > Currently there is no notification api for kernel about modification
>  > > of vfs inode, in some use cases like overlayfs, this kind of notification
>  > > will be very helpful to implement containerized syncfs functionality.
>  > > As the first attempt, we introduce marking inode dirty notification so that
>  > > overlay's inode could mark itself dirty as well and then only sync dirty
>  > > overlay inode while syncfs.
>  > >
>  > > Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
>  >
>  > So I like how the patch set is elegant however growing struct inode for
>  > everybody by struct blocking_notifier_head (which is rwsem + pointer) is
>  > rather harsh just for this overlayfs functionality... Ideally this should
>  > induce no overhead on struct inode if the filesystem is not covered by
>  > overlayfs. So I'd rather place some external structure into the superblock
>  > that would get allocated on the first use that would track dirty notifications
>  > for each inode. I would not generalize the code for more possible
>  > notifications at this point.
>  >
>  > Also now that I'm thinking about it can there be multiple overlayfs inodes
>  > for one upper inode? If not, then the mechanism of dirtiness propagation
>
> One upper inode only belongs to one overlayfs inode. However, in practice
> one upper fs may contains hundreds or even thousands of overlayfs instances.
>
>  > could be much simpler - it seems we could be able to just lookup
>  > corresponding overlayfs inode based on upper inode and then mark it dirty
>  > (but this would need to be verified by people more familiar with
>  > overlayfs). So all we'd need to know for this is the superblock of the
>  > overlayfs that's covering given upper filesystem...
>  >
>
> So the difficulty is how we get overlayfs inode efficiently from upper inode,
> it seems if we don't have additional info of upper inode to indicate which
> overlayfs it belongs to,  then the lookup of corresponding overlayfs inode
> will be quite expensive and probably impact write performance.
>
> Is that possible we extend inotify infrastructure slightly to notify both
> user space and kernel component?
>
>

When I first saw your suggestion, that is what I was thinking.
Add event fsnotify_dirty_inode(), since the pub-sub infrastructure
in struct inode already exists.

But I have to admit this approach seems like a massive overkill to
what you need.

What you implemented, tracks upper inodes that could have
been dirtied under overlayfs, but what you really want is to
track is upper inodes that were dirtied *by* overlayfs.

And for that purpose, as Miklos said several times, it would be best
pursue the overlayfs aops approach, even though it may be much
harder..

Your earlier patches maintained a list of overlayfs to be synced inodes.
Remind me what was wrong with that approach?

Perhaps you can combine that with the shadow overlay sbi approach.
Instead of dirtying overlay inode when underlying is dirtied, you can
"pre-dirty" overlayfs inode in higher level file ops and add them to the
"maybe-dirty" list (e.g. after write).

ovl_sync_fs() can iterate the maybe-dirty list and re-dirty overlay inodes
if the underlying inode is still dirty on the (!wait) pass.

As for memory mapped inodes via overlayfs (which can be dirtied without
notifying overlayfs) I am not sure that is a big problem in practice.

When an inode is writably mapped via ovarlayfs, you can flag the
overlay inode with "maybe-writably-mapped" and then remove
it from the maybe dirty list when the underlying inode is not dirty
AND its i_writecount is 0 (checked on write_inode() and release()).

Actually, there is no reason to treat writably mapped inodes and
other dirty inodes differently - insert to suspect list on open for
write, remove from suspect list on last release() or write_inode()
when inode is no longer dirty and writable.

Did I miss anything?

Amir.

Chengguang Xu Oct. 15, 2020, 10:56 a.m. UTC | #7

---- 在 星期四, 2020-10-15 12:57:41 Al Viro <viro@zeniv.linux.org.uk> 撰写 ----
 > On Thu, Oct 15, 2020 at 11:42:51AM +0800, Chengguang Xu wrote:
 > >  ---- 在 星期四, 2020-10-15 11:25:01 Al Viro <viro@zeniv.linux.org.uk> 撰写 ----
 > >  > On Sat, Oct 10, 2020 at 10:23:51PM +0800, Chengguang Xu wrote:
 > >  > > Currently there is no notification api for kernel about modification
 > >  > > of vfs inode, in some use cases like overlayfs, this kind of notification
 > >  > > will be very helpful to implement containerized syncfs functionality.
 > >  > > As the first attempt, we introduce marking inode dirty notification so that
 > >  > > overlay's inode could mark itself dirty as well and then only sync dirty
 > >  > > overlay inode while syncfs.
 > >  > 
 > >  > Who's responsible for removing the crap from notifier chain?  And how does
 > >  > that affect the lifetime of inode?
 > >  
 > > In this case, overlayfs unregisters call back from the notifier chain of upper inode
 > > when evicting it's own  inode. It will not affect the lifetime of upper inode because
 > > overlayfs inode holds a reference of upper inode that means upper inode will not be
 > > evicted while overlayfs inode is still alive.
 > 
 > Let me see if I've got it right:
 >     * your chain contains 1 (for upper inodes) or 0 (everything else, i.e. the
 > vast majority of inodes) recepients
 >     * recepient pins the inode for as long as the recepient exists
 > 
 > That looks like a massive overkill, especially since all you are propagating is
 > dirtying the suckers.  All you really need is one bit in your inode + hash table
 > indexed by the address of struct inode (well, middle bits thereof, as usual).
 > With entries embedded into overlayfs-private part of overlayfs inode.  And callback
 > to be called stored in that entry...
 > 


Yeah, actually  that could implement the logic but I'm afraid of performance degradation
caused by lock contention of hash table in concurrent file write because in practice we
build up hundreds of overlayfs instances on same underlying filesystem.

Thanks,
Chengguang

Chengguang Xu Oct. 15, 2020, 11:29 a.m. UTC | #8

---- 在 星期四, 2020-10-15 14:11:04 Amir Goldstein <amir73il@gmail.com> 撰写 ----
 > On Thu, Oct 15, 2020 at 6:03 AM Chengguang Xu <cgxu519@mykernel.net> wrote:
 > >
 > >  ---- 在 星期四, 2020-10-15 00:15:38 Jan Kara <jack@suse.cz> 撰写 ----
 > >  > On Sat 10-10-20 22:23:51, Chengguang Xu wrote:
 > >  > > Currently there is no notification api for kernel about modification
 > >  > > of vfs inode, in some use cases like overlayfs, this kind of notification
 > >  > > will be very helpful to implement containerized syncfs functionality.
 > >  > > As the first attempt, we introduce marking inode dirty notification so that
 > >  > > overlay's inode could mark itself dirty as well and then only sync dirty
 > >  > > overlay inode while syncfs.
 > >  > >
 > >  > > Signed-off-by: Chengguang Xu <cgxu519@mykernel.net>
 > >  >
 > >  > So I like how the patch set is elegant however growing struct inode for
 > >  > everybody by struct blocking_notifier_head (which is rwsem + pointer) is
 > >  > rather harsh just for this overlayfs functionality... Ideally this should
 > >  > induce no overhead on struct inode if the filesystem is not covered by
 > >  > overlayfs. So I'd rather place some external structure into the superblock
 > >  > that would get allocated on the first use that would track dirty notifications
 > >  > for each inode. I would not generalize the code for more possible
 > >  > notifications at this point.
 > >  >
 > >  > Also now that I'm thinking about it can there be multiple overlayfs inodes
 > >  > for one upper inode? If not, then the mechanism of dirtiness propagation
 > >
 > > One upper inode only belongs to one overlayfs inode. However, in practice
 > > one upper fs may contains hundreds or even thousands of overlayfs instances.
 > >
 > >  > could be much simpler - it seems we could be able to just lookup
 > >  > corresponding overlayfs inode based on upper inode and then mark it dirty
 > >  > (but this would need to be verified by people more familiar with
 > >  > overlayfs). So all we'd need to know for this is the superblock of the
 > >  > overlayfs that's covering given upper filesystem...
 > >  >
 > >
 > > So the difficulty is how we get overlayfs inode efficiently from upper inode,
 > > it seems if we don't have additional info of upper inode to indicate which
 > > overlayfs it belongs to,  then the lookup of corresponding overlayfs inode
 > > will be quite expensive and probably impact write performance.
 > >
 > > Is that possible we extend inotify infrastructure slightly to notify both
 > > user space and kernel component?
 > >
 > >
 > 
 > When I first saw your suggestion, that is what I was thinking.
 > Add event fsnotify_dirty_inode(), since the pub-sub infrastructure
 > in struct inode already exists.
 > 
 > But I have to admit this approach seems like a massive overkill to
 > what you need.
 > 
 > What you implemented, tracks upper inodes that could have
 > been dirtied under overlayfs, but what you really want is to
 > track is upper inodes that were dirtied *by* overlayfs.
 > 
 > And for that purpose, as Miklos said several times, it would be best
 > pursue the overlayfs aops approach, even though it may be much
 > harder..
 > 

IIUC, that solution was raised to solve mmap rw-ro issue.
That maybe is an ultimate goal and I'm wondering whether we must
implement that if we have easier approach to solve mmap and syncfs
issues.

 > Your earlier patches maintained a list of overlayfs to be synced inodes.
 > Remind me what was wrong with that approach?
 > 

I think the main concerns are the complexity and the timing of releasing
ovl_sync_entry  struct.


 > Perhaps you can combine that with the shadow overlay sbi approach.
 > Instead of dirtying overlay inode when underlying is dirtied, you can
 > "pre-dirty" overlayfs inode in higher level file ops and add them to the
 > "maybe-dirty" list (e.g. after write).
 > 

Main problem is we can't be notified by set_page_dirty from writable mmap.
Meanwhile, if we dirty overlay inode then writeback will pick up dirty overlay
inode and clear it after syncing, then overlay inode could be release at any time,
so in the end, maybe overlay inode is released but upper inode is still dirty and
there is no any pointer to find upper dirty inode out.


 > ovl_sync_fs() can iterate the maybe-dirty list and re-dirty overlay inodes
 > if the underlying inode is still dirty on the (!wait) pass.
 > 
 > As for memory mapped inodes via overlayfs (which can be dirtied without
 > notifying overlayfs) I am not sure that is a big problem in practice.
 > 

Yes, it's key problem here.

 > When an inode is writably mapped via ovarlayfs, you can flag the
 > overlay inode with "maybe-writably-mapped" and then remove
 > it from the maybe dirty list when the underlying inode is not dirty
 > AND its i_writecount is 0 (checked on write_inode() and release()).
 > 
 > Actually, there is no reason to treat writably mapped inodes and
 > other dirty inodes differently - insert to suspect list on open for
 > write, remove from suspect list on last release() or write_inode()
 > when inode is no longer dirty and writable.
 > 
 > Did I miss anything?
 > 

If we dirty overlay inode that means we have launched writeback mechanism,
so in this case, re-dirty overlay inode in time becomes important. 



Thanks,
Chengguang

Amir Goldstein Oct. 15, 2020, 12:32 p.m. UTC | #9

>  > Perhaps you can combine that with the shadow overlay sbi approach.
>  > Instead of dirtying overlay inode when underlying is dirtied, you can
>  > "pre-dirty" overlayfs inode in higher level file ops and add them to the
>  > "maybe-dirty" list (e.g. after write).
>  >
>
> Main problem is we can't be notified by set_page_dirty from writable mmap.
> Meanwhile, if we dirty overlay inode then writeback will pick up dirty overlay
> inode and clear it after syncing, then overlay inode could be release at any time,
> so in the end, maybe overlay inode is released but upper inode is still dirty and
> there is no any pointer to find upper dirty inode out.
>

But we can control whether overlay inode is release with ovl_drop_inode()
right? Can we prevent dropping overlay inode if upper inode is
inode_is_open_for_write()?

About re-dirty, see below...

>
>  > ovl_sync_fs() can iterate the maybe-dirty list and re-dirty overlay inodes
>  > if the underlying inode is still dirty on the (!wait) pass.
>  >
>  > As for memory mapped inodes via overlayfs (which can be dirtied without
>  > notifying overlayfs) I am not sure that is a big problem in practice.
>  >
>
> Yes, it's key problem here.
>
>  > When an inode is writably mapped via ovarlayfs, you can flag the
>  > overlay inode with "maybe-writably-mapped" and then remove
>  > it from the maybe dirty list when the underlying inode is not dirty
>  > AND its i_writecount is 0 (checked on write_inode() and release()).
>  >
>  > Actually, there is no reason to treat writably mapped inodes and
>  > other dirty inodes differently - insert to suspect list on open for
>  > write, remove from suspect list on last release() or write_inode()
>  > when inode is no longer dirty and writable.
>  >
>  > Did I miss anything?
>  >
>
> If we dirty overlay inode that means we have launched writeback mechanism,
> so in this case, re-dirty overlay inode in time becomes important.
>

My idea was to use the first call to ovl_sync_fs() with 'wait' false
to iterate the
maybe-dirty list and re-dirty overlay inodes whose upper is dirty.

Then in the second call to __sync_filesystem, sync_inodes_sb() will take
care of calling ovl_write_inode() for all the re-dirty inodes.

In current code we sync ALL dirty upper fs inodes and we do not overlay
inodes with no reference in cache.

The best code would sync only upper fs inodes dirtied by this overlay
and will be able to evict overlay inodes whose upper inodes are clean.

The compromise code would sync only upper fs inodes dirtied by this overlay,
and would not evict overlay inodes as long as upper inodes are "open for write".
I think its a fine compromise considering the alternatives.

Is this workable?

Thanks,
Amir.

Chengguang Xu Oct. 15, 2020, 1:13 p.m. UTC | #10

---- 在 星期四, 2020-10-15 20:32:24 Amir Goldstein <amir73il@gmail.com> 撰写 ----
 > >  > Perhaps you can combine that with the shadow overlay sbi approach.
 > >  > Instead of dirtying overlay inode when underlying is dirtied, you can
 > >  > "pre-dirty" overlayfs inode in higher level file ops and add them to the
 > >  > "maybe-dirty" list (e.g. after write).
 > >  >
 > >
 > > Main problem is we can't be notified by set_page_dirty from writable mmap.
 > > Meanwhile, if we dirty overlay inode then writeback will pick up dirty overlay
 > > inode and clear it after syncing, then overlay inode could be release at any time,
 > > so in the end, maybe overlay inode is released but upper inode is still dirty and
 > > there is no any pointer to find upper dirty inode out.
 > >
 > 
 > But we can control whether overlay inode is release with ovl_drop_inode()
 > right? Can we prevent dropping overlay inode if upper inode is
 > inode_is_open_for_write()?
 > 
 > About re-dirty, see below...
 > 
 > >
 > >  > ovl_sync_fs() can iterate the maybe-dirty list and re-dirty overlay inodes
 > >  > if the underlying inode is still dirty on the (!wait) pass.
 > >  >
 > >  > As for memory mapped inodes via overlayfs (which can be dirtied without
 > >  > notifying overlayfs) I am not sure that is a big problem in practice.
 > >  >
 > >
 > > Yes, it's key problem here.
 > >
 > >  > When an inode is writably mapped via ovarlayfs, you can flag the
 > >  > overlay inode with "maybe-writably-mapped" and then remove
 > >  > it from the maybe dirty list when the underlying inode is not dirty
 > >  > AND its i_writecount is 0 (checked on write_inode() and release()).
 > >  >
 > >  > Actually, there is no reason to treat writably mapped inodes and
 > >  > other dirty inodes differently - insert to suspect list on open for
 > >  > write, remove from suspect list on last release() or write_inode()
 > >  > when inode is no longer dirty and writable.

I have to say inserting to suspect list cannot prevent dropping,
thinking of the problem of previous approach that we write dirty upper
inode with current->flags & PF_MEMALLOC while evicting clean overlay inode.


 > >  >
 > >  > Did I miss anything?
 > >  >
 > >
 > > If we dirty overlay inode that means we have launched writeback mechanism,
 > > so in this case, re-dirty overlay inode in time becomes important.
 > >
 > 
 > My idea was to use the first call to ovl_sync_fs() with 'wait' false
 > to iterate the
 > maybe-dirty list and re-dirty overlay inodes whose upper is dirty.
 > 

I'm curious how we prevent dropping of clean overlay inode with dirty upper?
Hold another reference during iput_final operation? in the drop_inode() or something
else?


 > Then in the second call to __sync_filesystem, sync_inodes_sb() will take
 > care of calling ovl_write_inode() for all the re-dirty inodes.
 > 
 > In current code we sync ALL dirty upper fs inodes and we do not overlay
 > inodes with no reference in cache.
 > 
 > The best code would sync only upper fs inodes dirtied by this overlay
 > and will be able to evict overlay inodes whose upper inodes are clean.
 > 
 > The compromise code would sync only upper fs inodes dirtied by this overlay,
 > and would not evict overlay inodes as long as upper inodes are "open for write".
 > I think its a fine compromise considering the alternatives.
 > 
 > Is this workable?
 > 

In your approach, the key point is how to prevent dropping overlay inode that has
dirty upper and no reference but I don't understand well how to achieve it from
your descriptions.


Thanks,
Chengguang

Amir Goldstein Oct. 15, 2020, 4:02 p.m. UTC | #11

>  > >  > When an inode is writably mapped via ovarlayfs, you can flag the
>  > >  > overlay inode with "maybe-writably-mapped" and then remove
>  > >  > it from the maybe dirty list when the underlying inode is not dirty
>  > >  > AND its i_writecount is 0 (checked on write_inode() and release()).
>  > >  >
>  > >  > Actually, there is no reason to treat writably mapped inodes and
>  > >  > other dirty inodes differently - insert to suspect list on open for
>  > >  > write, remove from suspect list on last release() or write_inode()
>  > >  > when inode is no longer dirty and writable.
>
> I have to say inserting to suspect list cannot prevent dropping,
> thinking of the problem of previous approach that we write dirty upper
> inode with current->flags & PF_MEMALLOC while evicting clean overlay inode.
>

Sorry, I don't understand what that means.

>
>  > >  >
>  > >  > Did I miss anything?
>  > >  >
>  > >
>  > > If we dirty overlay inode that means we have launched writeback mechanism,
>  > > so in this case, re-dirty overlay inode in time becomes important.
>  > >
>  >
>  > My idea was to use the first call to ovl_sync_fs() with 'wait' false
>  > to iterate the
>  > maybe-dirty list and re-dirty overlay inodes whose upper is dirty.
>  >
>
> I'm curious how we prevent dropping of clean overlay inode with dirty upper?
> Hold another reference during iput_final operation? in the drop_inode() or something
> else?

No, just return 0 from ovl_drop_inode() and iput_final() will not evict().

>
>
>  > Then in the second call to __sync_filesystem, sync_inodes_sb() will take
>  > care of calling ovl_write_inode() for all the re-dirty inodes.
>  >
>  > In current code we sync ALL dirty upper fs inodes and we do not overlay
>  > inodes with no reference in cache.
>  >
>  > The best code would sync only upper fs inodes dirtied by this overlay
>  > and will be able to evict overlay inodes whose upper inodes are clean.
>  >
>  > The compromise code would sync only upper fs inodes dirtied by this overlay,
>  > and would not evict overlay inodes as long as upper inodes are "open for write".
>  > I think its a fine compromise considering the alternatives.
>  >
>  > Is this workable?
>  >
>
> In your approach, the key point is how to prevent dropping overlay inode that has
> dirty upper and no reference but I don't understand well how to achieve it from
> your descriptions.
>
>

Very well, I will try to explain with code:

int ovl_inode_is_open_for_write(struct inode *inode)
{
       struct inode *upper_inode = ovl_inode_upper(inode);

       return upper_inode && inode_is_open_for_write(upper_inode);
}

void ovl_maybe_mark_inode_dirty(struct inode *inode)
{
       struct inode *upper_inode = ovl_inode_upper(inode);

       if (upper_inode && upper_inode->i_state & I_DIRTY_ALL)
                mark_inode_dirty(inode);
}

int ovl_open(struct inode *inode, struct file *file)
{
...
       if (ovl_inode_is_open_for_write(file_inode(file)))
               ovl_add_inode_to_suspect_list(inode);

        file->private_data = realfile;

        return 0;
}

int ovl_release(struct inode *inode, struct file *file)
{
       struct inode *inode = file_inode(file);

       if (ovl_inode_is_open_for_write(inode)) {
               ovl_maybe_mark_inode_dirty(inode);
               ovl_remove_inode_from_suspect_list(inode);
       }

        fput(file->private_data);

        return 0;
}

int ovl_drop_inode(struct inode *inode)
{
       struct inode *upper_inode = ovl_inode_upper(inode);

       if (!upper_inode)
               return 1;
       if (upper_inode->i_state & I_DIRTY_ALL)
               return 0;

       return !inode_is_open_for_write(upper_inode);
}

static int ovl_sync_fs(struct super_block *sb, int wait)
{
        struct ovl_fs *ofs = sb->s_fs_info;
        struct super_block *upper_sb;
        int ret;

        if (!ovl_upper_mnt(ofs))
                return 0;

        /*
         * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC).
         * All the super blocks will be iterated, including upper_sb.
         *
         * If this is a syncfs(2) call, then we do need to call
         * sync_filesystem() on upper_sb, but enough if we do it when being
         * called with wait == 1.
         */
        if (!wait) {
                /* mark inodes on the suspect list dirty if thier
upper inode is dirty */
                ovl_mark_suspect_list_inodes_dirty();
                return 0;
        }
...


The races are avoided because inode is added/removed from suspect
list while overlay inode has a reference (from file) and because upper inode
cannot be dirtied by overlayfs when overlay inode is not on the suspect list.

Unless I am missing something.

Thanks,
Amir.

Amir Goldstein Oct. 15, 2020, 4:06 p.m. UTC | #12

> Very well, I will try to explain with code:
>
> int ovl_inode_is_open_for_write(struct inode *inode)
> {
>        struct inode *upper_inode = ovl_inode_upper(inode);
>
>        return upper_inode && inode_is_open_for_write(upper_inode);
> }
>
> void ovl_maybe_mark_inode_dirty(struct inode *inode)
> {
>        struct inode *upper_inode = ovl_inode_upper(inode);
>
>        if (upper_inode && upper_inode->i_state & I_DIRTY_ALL)
>                 mark_inode_dirty(inode);
> }
>
> int ovl_open(struct inode *inode, struct file *file)
> {
> ...
>        if (ovl_inode_is_open_for_write(file_inode(file)))
>                ovl_add_inode_to_suspect_list(inode);
>
>         file->private_data = realfile;
>
>         return 0;
> }
>
> int ovl_release(struct inode *inode, struct file *file)
> {
>        struct inode *inode = file_inode(file);
>
>        if (ovl_inode_is_open_for_write(inode)) {

Darn! that should have been (!ovl_inode_is_open_for_write(inode))
and it should be done after fput() and grabbing inode reference with
ihold() before fput().

>                ovl_maybe_mark_inode_dirty(inode);
>                ovl_remove_inode_from_suspect_list(inode);
>        }
>
>         fput(file->private_data);
>
>         return 0;
> }
>
> int ovl_drop_inode(struct inode *inode)
> {
>        struct inode *upper_inode = ovl_inode_upper(inode);
>
>        if (!upper_inode)
>                return 1;
>        if (upper_inode->i_state & I_DIRTY_ALL)
>                return 0;
>
>        return !inode_is_open_for_write(upper_inode);
> }
>
> static int ovl_sync_fs(struct super_block *sb, int wait)
> {
>         struct ovl_fs *ofs = sb->s_fs_info;
>         struct super_block *upper_sb;
>         int ret;
>
>         if (!ovl_upper_mnt(ofs))
>                 return 0;
>
>         /*
>          * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC).
>          * All the super blocks will be iterated, including upper_sb.
>          *
>          * If this is a syncfs(2) call, then we do need to call
>          * sync_filesystem() on upper_sb, but enough if we do it when being
>          * called with wait == 1.
>          */
>         if (!wait) {
>                 /* mark inodes on the suspect list dirty if thier
> upper inode is dirty */
>                 ovl_mark_suspect_list_inodes_dirty();
>                 return 0;
>         }
> ...
>
>
> The races are avoided because inode is added/removed from suspect
> list while overlay inode has a reference (from file) and because upper inode
> cannot be dirtied by overlayfs when overlay inode is not on the suspect list.
>
> Unless I am missing something.
>
> Thanks,
> Amir.

Chengguang Xu Oct. 16, 2020, 1:56 a.m. UTC | #13

---- 在 星期五, 2020-10-16 00:02:42 Amir Goldstein <amir73il@gmail.com> 撰写 ----
 > >  > >  > When an inode is writably mapped via ovarlayfs, you can flag the
 > >  > >  > overlay inode with "maybe-writably-mapped" and then remove
 > >  > >  > it from the maybe dirty list when the underlying inode is not dirty
 > >  > >  > AND its i_writecount is 0 (checked on write_inode() and release()).
 > >  > >  >
 > >  > >  > Actually, there is no reason to treat writably mapped inodes and
 > >  > >  > other dirty inodes differently - insert to suspect list on open for
 > >  > >  > write, remove from suspect list on last release() or write_inode()
 > >  > >  > when inode is no longer dirty and writable.
 > >
 > > I have to say inserting to suspect list cannot prevent dropping,
 > > thinking of the problem of previous approach that we write dirty upper
 > > inode with current->flags & PF_MEMALLOC while evicting clean overlay inode.
 > >
 > 
 > Sorry, I don't understand what that means.

This is the main problem of my previous patch set V10, evicting clean inode
expects no write behavior but in the case of dirty upper inode we have to
write out dirty data in this timing otherwise we will lose the connection with upper inode.


 > 
 > >
 > >  > >  >
 > >  > >  > Did I miss anything?
 > >  > >  >
 > >  > >
 > >  > > If we dirty overlay inode that means we have launched writeback mechanism,
 > >  > > so in this case, re-dirty overlay inode in time becomes important.
 > >  > >
 > >  >
 > >  > My idea was to use the first call to ovl_sync_fs() with 'wait' false
 > >  > to iterate the
 > >  > maybe-dirty list and re-dirty overlay inodes whose upper is dirty.
 > >  >
 > >
 > > I'm curious how we prevent dropping of clean overlay inode with dirty upper?
 > > Hold another reference during iput_final operation? in the drop_inode() or something
 > > else?
 > 
 > No, just return 0 from ovl_drop_inode() and iput_final() will not evict().

It's not good,  it  only temporarily  skips eviction, the inode in lru list
will be evicted in some cases like drop cache or memory reclaim. etc.

A solution for this is getting another reference in ->drop_inode so that
the inode can escape from adding to lru list but this looks awkward and tricky.

 > 
 > >
 > >
 > >  > Then in the second call to __sync_filesystem, sync_inodes_sb() will take
 > >  > care of calling ovl_write_inode() for all the re-dirty inodes.
 > >  >
 > >  > In current code we sync ALL dirty upper fs inodes and we do not overlay
 > >  > inodes with no reference in cache.
 > >  >
 > >  > The best code would sync only upper fs inodes dirtied by this overlay
 > >  > and will be able to evict overlay inodes whose upper inodes are clean.
 > >  >
 > >  > The compromise code would sync only upper fs inodes dirtied by this overlay,
 > >  > and would not evict overlay inodes as long as upper inodes are "open for write".
 > >  > I think its a fine compromise considering the alternatives.
 > >  >
 > >  > Is this workable?
 > >  >
 > >
 > > In your approach, the key point is how to prevent dropping overlay inode that has
 > > dirty upper and no reference but I don't understand well how to achieve it from
 > > your descriptions.
 > >
 > >
 > 
 > Very well, I will try to explain with code:
 > 
 > int ovl_inode_is_open_for_write(struct inode *inode)
 > {
 >        struct inode *upper_inode = ovl_inode_upper(inode);
 > 
 >        return upper_inode && inode_is_open_for_write(upper_inode);
 > }
 > 
 > void ovl_maybe_mark_inode_dirty(struct inode *inode)
 > {
 >        struct inode *upper_inode = ovl_inode_upper(inode);
 > 
 >        if (upper_inode && upper_inode->i_state & I_DIRTY_ALL)
 >                 mark_inode_dirty(inode);
 > }
 > 
 > int ovl_open(struct inode *inode, struct file *file)
 > {
 > ...
 >        if (ovl_inode_is_open_for_write(file_inode(file)))
 >                ovl_add_inode_to_suspect_list(inode);
 > 
 >         file->private_data = realfile;
 > 
 >         return 0;
 > }
 > 
 > int ovl_release(struct inode *inode, struct file *file)
 > {
 >        struct inode *inode = file_inode(file);
 > 
 >        if (ovl_inode_is_open_for_write(inode)) {
 >                ovl_maybe_mark_inode_dirty(inode);
 >                ovl_remove_inode_from_suspect_list(inode);

I think in some cases removing from suspect_list will cause losing
the connection with upper inode that has writable mmap.


 >        }
 > 
 >         fput(file->private_data);
 > 
 >         return 0;
 > }
 > 
 > int ovl_drop_inode(struct inode *inode)
 > {
 >        struct inode *upper_inode = ovl_inode_upper(inode);
 > 
 >        if (!upper_inode)
 >                return 1;
 >        if (upper_inode->i_state & I_DIRTY_ALL)
 >                return 0;
 > 
 >        return !inode_is_open_for_write(upper_inode);

Is this condition just for writable mmap?


 > }
 > 
 > static int ovl_sync_fs(struct super_block *sb, int wait)
 > {
 >         struct ovl_fs *ofs = sb->s_fs_info;
 >         struct super_block *upper_sb;
 >         int ret;
 > 
 >         if (!ovl_upper_mnt(ofs))
 >                 return 0;
 > 
 >         /*
 >          * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC).
 >          * All the super blocks will be iterated, including upper_sb.
 >          *
 >          * If this is a syncfs(2) call, then we do need to call
 >          * sync_filesystem() on upper_sb, but enough if we do it when being
 >          * called with wait == 1.
 >          */
 >         if (!wait) {
 >                 /* mark inodes on the suspect list dirty if thier
 > upper inode is dirty */
 >                 ovl_mark_suspect_list_inodes_dirty();
 >                 return 0;
 >         }
 > ...
 > 

Why 2 rounds?  it seems the simplest way is only syncing dirty upper inode
on wait==1 just like my previous patch.


 > 
 > The races are avoided because inode is added/removed from suspect
 > list while overlay inode has a reference (from file) and because upper inode
 > cannot be dirtied by overlayfs when overlay inode is not on the suspect list.
 > 
 > Unless I am missing something.
 > 

Thanks,
Chengguang

Amir Goldstein Oct. 16, 2020, 4:39 a.m. UTC | #14

On Fri, Oct 16, 2020 at 4:56 AM Chengguang Xu <cgxu519@mykernel.net> wrote:
>
>  ---- 在 星期五, 2020-10-16 00:02:42 Amir Goldstein <amir73il@gmail.com> 撰写 ----
>  > >  > >  > When an inode is writably mapped via ovarlayfs, you can flag the
>  > >  > >  > overlay inode with "maybe-writably-mapped" and then remove
>  > >  > >  > it from the maybe dirty list when the underlying inode is not dirty
>  > >  > >  > AND its i_writecount is 0 (checked on write_inode() and release()).
>  > >  > >  >
>  > >  > >  > Actually, there is no reason to treat writably mapped inodes and
>  > >  > >  > other dirty inodes differently - insert to suspect list on open for
>  > >  > >  > write, remove from suspect list on last release() or write_inode()
>  > >  > >  > when inode is no longer dirty and writable.
>  > >
>  > > I have to say inserting to suspect list cannot prevent dropping,
>  > > thinking of the problem of previous approach that we write dirty upper
>  > > inode with current->flags & PF_MEMALLOC while evicting clean overlay inode.
>  > >
>  >
>  > Sorry, I don't understand what that means.
>
> This is the main problem of my previous patch set V10, evicting clean inode
> expects no write behavior but in the case of dirty upper inode we have to
> write out dirty data in this timing otherwise we will lose the connection with upper inode.
>

My thinking was that the suspect list holds a reference to the overlay inode.
The question is can we always safely get rid of that reference and remove
from the suspect list when the inode is no longer "writable". Let's see...

>
>  >
>  > >
>  > >  > >  >
>  > >  > >  > Did I miss anything?
>  > >  > >  >
>  > >  > >
>  > >  > > If we dirty overlay inode that means we have launched writeback mechanism,
>  > >  > > so in this case, re-dirty overlay inode in time becomes important.
>  > >  > >
>  > >  >
>  > >  > My idea was to use the first call to ovl_sync_fs() with 'wait' false
>  > >  > to iterate the
>  > >  > maybe-dirty list and re-dirty overlay inodes whose upper is dirty.
>  > >  >
>  > >
>  > > I'm curious how we prevent dropping of clean overlay inode with dirty upper?
>  > > Hold another reference during iput_final operation? in the drop_inode() or something
>  > > else?
>  >
>  > No, just return 0 from ovl_drop_inode() and iput_final() will not evict().
>
> It's not good,  it  only temporarily  skips eviction, the inode in lru list
> will be evicted in some cases like drop cache or memory reclaim. etc.
>
> A solution for this is getting another reference in ->drop_inode so that
> the inode can escape from adding to lru list but this looks awkward and tricky.
>

Right, that was nonsense. We need to rely on the reference held by the
suspect list.

>  >
>  > >
>  > >
>  > >  > Then in the second call to __sync_filesystem, sync_inodes_sb() will take
>  > >  > care of calling ovl_write_inode() for all the re-dirty inodes.
>  > >  >
>  > >  > In current code we sync ALL dirty upper fs inodes and we do not overlay
>  > >  > inodes with no reference in cache.
>  > >  >
>  > >  > The best code would sync only upper fs inodes dirtied by this overlay
>  > >  > and will be able to evict overlay inodes whose upper inodes are clean.
>  > >  >
>  > >  > The compromise code would sync only upper fs inodes dirtied by this overlay,
>  > >  > and would not evict overlay inodes as long as upper inodes are "open for write".
>  > >  > I think its a fine compromise considering the alternatives.
>  > >  >
>  > >  > Is this workable?
>  > >  >
>  > >
>  > > In your approach, the key point is how to prevent dropping overlay inode that has
>  > > dirty upper and no reference but I don't understand well how to achieve it from
>  > > your descriptions.
>  > >
>  > >
>  >
>  > Very well, I will try to explain with code:
>  >
>  > int ovl_inode_is_open_for_write(struct inode *inode)
>  > {
>  >        struct inode *upper_inode = ovl_inode_upper(inode);
>  >
>  >        return upper_inode && inode_is_open_for_write(upper_inode);
>  > }
>  >
>  > void ovl_maybe_mark_inode_dirty(struct inode *inode)
>  > {
>  >        struct inode *upper_inode = ovl_inode_upper(inode);
>  >
>  >        if (upper_inode && upper_inode->i_state & I_DIRTY_ALL)
>  >                 mark_inode_dirty(inode);
>  > }
>  >
>  > int ovl_open(struct inode *inode, struct file *file)
>  > {
>  > ...
>  >        if (ovl_inode_is_open_for_write(file_inode(file)))
>  >                ovl_add_inode_to_suspect_list(inode);
>  >
>  >         file->private_data = realfile;
>  >
>  >         return 0;
>  > }
>  >
>  > int ovl_release(struct inode *inode, struct file *file)
>  > {
>  >        struct inode *inode = file_inode(file);
>  >
>  >        if (ovl_inode_is_open_for_write(inode)) {
>  >                ovl_maybe_mark_inode_dirty(inode);
>  >                ovl_remove_inode_from_suspect_list(inode);
>
> I think in some cases removing from suspect_list will cause losing
> the connection with upper inode that has writable mmap.
>

First of all I had a bug here.
Need to check for !ovl_inode_is_open_for_write(inode) after fput().

If the upper inode has a writable mmap, the upper inode would still
be "writable" (i_writecount held by the map realfile reference).

So when closing the last overlay file reference while upper inode
writable maps still exist, the remaining issue is when to remove
the overlay inode from the suspect list and allow its eviction and
I did not mention that.

I *think* that this condition should be safe in the regard that
after the condition is met, there is no way to dirty the upper inode
via overlayfs without going through ovl_open().
Obviously, the test should be done with some list lock held.

bool ovl_may_remove_from_suspect_list(struct inode *inode)
{
        struct inode *upper_inode = ovl_inode_upper(inode);

        if (upper_inode && upper_inode->i_state & I_DIRTY_ALL)
                return false;

        return !inode_is_open_for_write(upper_inode);
}

Now remains the question of WHEN to check for removal
from the suspect list.

The first place is in ovl_sync_fs() when iterating the suspect list,
inodes that meet the above criteria are "indefinitely clean" and
may be removed from the list at that timing.

For eviction during memory pressure, overlay can register a
shrinker to do this garbage collection. Is shrinker being called
on drop_caches? I'm not sure. But we can also do periodic garbage
collection.

In the end, it is not the common case and we just need the garbage
collection to mitigate DoS (or existing workload) that does many:
- open
- mmap(...PROT_WRITE, MAP_SHARED...)
- close
- munmap


>
>  >        }
>  >
>  >         fput(file->private_data);
>  >
>  >         return 0;
>  > }
>  >
>  > int ovl_drop_inode(struct inode *inode)
>  > {
>  >        struct inode *upper_inode = ovl_inode_upper(inode);
>  >
>  >        if (!upper_inode)
>  >                return 1;
>  >        if (upper_inode->i_state & I_DIRTY_ALL)
>  >                return 0;
>  >
>  >        return !inode_is_open_for_write(upper_inode);
>
> Is this condition just for writable mmap?
>

No, it's for all inodes that may be written from overlayfs (or
from maps created by overalyfs), but as you wrote, this
test is not needed in drop_inode().

>
>  > }
>  >
>  > static int ovl_sync_fs(struct super_block *sb, int wait)
>  > {
>  >         struct ovl_fs *ofs = sb->s_fs_info;
>  >         struct super_block *upper_sb;
>  >         int ret;
>  >
>  >         if (!ovl_upper_mnt(ofs))
>  >                 return 0;
>  >
>  >         /*
>  >          * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC).
>  >          * All the super blocks will be iterated, including upper_sb.
>  >          *
>  >          * If this is a syncfs(2) call, then we do need to call
>  >          * sync_filesystem() on upper_sb, but enough if we do it when being
>  >          * called with wait == 1.
>  >          */
>  >         if (!wait) {
>  >                 /* mark inodes on the suspect list dirty if thier
>  > upper inode is dirty */
>  >                 ovl_mark_suspect_list_inodes_dirty();
>  >                 return 0;
>  >         }
>  > ...
>  >
>
> Why 2 rounds?  it seems the simplest way is only syncing dirty upper inode
> on wait==1 just like my previous patch.
>

We don't have to do it in 2 rounds, but as long as we have a suspect list,
we can use it to start writeback on wait==0 on all dirty upper inodes from
our list just like the caller intended.

Do we need overlay sbi and mark_inode_dirty() and ovl_write_inode() at all?
I'm not sure. It feels like a good idea to use generic infrastructure as much as
possible. You should know better than me to answer this question.

Thanks,
Amir.

Chengguang Xu Oct. 16, 2020, 7:09 a.m. UTC | #15

---- 在 星期四, 2020-10-15 12:57:41 Al Viro <viro@zeniv.linux.org.uk> 撰写 ----
 > On Thu, Oct 15, 2020 at 11:42:51AM +0800, Chengguang Xu wrote:
 > >  ---- 在 星期四, 2020-10-15 11:25:01 Al Viro <viro@zeniv.linux.org.uk> 撰写 ----
 > >  > On Sat, Oct 10, 2020 at 10:23:51PM +0800, Chengguang Xu wrote:
 > >  > > Currently there is no notification api for kernel about modification
 > >  > > of vfs inode, in some use cases like overlayfs, this kind of notification
 > >  > > will be very helpful to implement containerized syncfs functionality.
 > >  > > As the first attempt, we introduce marking inode dirty notification so that
 > >  > > overlay's inode could mark itself dirty as well and then only sync dirty
 > >  > > overlay inode while syncfs.
 > >  > 
 > >  > Who's responsible for removing the crap from notifier chain?  And how does
 > >  > that affect the lifetime of inode?
 > >  
 > > In this case, overlayfs unregisters call back from the notifier chain of upper inode
 > > when evicting it's own  inode. It will not affect the lifetime of upper inode because
 > > overlayfs inode holds a reference of upper inode that means upper inode will not be
 > > evicted while overlayfs inode is still alive.
 > 
 > Let me see if I've got it right:
 >     * your chain contains 1 (for upper inodes) or 0 (everything else, i.e. the
 > vast majority of inodes) recepients
 >     * recepient pins the inode for as long as the recepient exists
 > 
 > That looks like a massive overkill, especially since all you are propagating is
 > dirtying the suckers.  All you really need is one bit in your inode + hash table
 > indexed by the address of struct inode (well, middle bits thereof, as usual).
 > With entries embedded into overlayfs-private part of overlayfs inode.  And callback
 > to be called stored in that entry...
 > 

Hi AI, Jack, Amir

Based on your feedback, I would to change the inode dirty notification
something like below, is it acceptable? 


diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1492271..48473d9 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2249,6 +2249,14 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 
        trace_writeback_mark_inode_dirty(inode, flags);
 
+       if (inode->state & I_OVL_INUSE) {
+               struct inode *ovl_inode;
+
+               ovl_inode = ilookup5(NULL, (unsigned long)inode, ovl_inode_test, inode);
+               if (ovl_inode)
+                       __mark_inode_dirty(ovl_inode, flags);
+       }
+
        /*
         * Don't do this for I_DIRTY_PAGES - that doesn't actually
         * dirty the inode itself
diff --git a/fs/inode.c b/fs/inode.c
index 72c4c34..ed6c85e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -820,7 +820,7 @@ static struct inode *find_inode(struct super_block *sb,
 
 repeat:
        hlist_for_each_entry(inode, head, i_hash) {
-               if (inode->i_sb != sb)
+               if (sb && inode->i_sb != sb)
                        continue;
                if (!test(inode, data))
                        continue;


Thanks,
Chengguang

Chengguang Xu Oct. 16, 2020, 7:43 a.m. UTC | #16

---- 在 星期五, 2020-10-16 12:39:10 Amir Goldstein <amir73il@gmail.com> 撰写 ----
 > On Fri, Oct 16, 2020 at 4:56 AM Chengguang Xu <cgxu519@mykernel.net> wrote:
 > >
 > >  ---- 在 星期五, 2020-10-16 00:02:42 Amir Goldstein <amir73il@gmail.com> 撰写 ----
 > >  > >  > >  > When an inode is writably mapped via ovarlayfs, you can flag the
 > >  > >  > >  > overlay inode with "maybe-writably-mapped" and then remove
 > >  > >  > >  > it from the maybe dirty list when the underlying inode is not dirty
 > >  > >  > >  > AND its i_writecount is 0 (checked on write_inode() and release()).
 > >  > >  > >  >
 > >  > >  > >  > Actually, there is no reason to treat writably mapped inodes and
 > >  > >  > >  > other dirty inodes differently - insert to suspect list on open for
 > >  > >  > >  > write, remove from suspect list on last release() or write_inode()
 > >  > >  > >  > when inode is no longer dirty and writable.
 > >  > >
 > >  > > I have to say inserting to suspect list cannot prevent dropping,
 > >  > > thinking of the problem of previous approach that we write dirty upper
 > >  > > inode with current->flags & PF_MEMALLOC while evicting clean overlay inode.
 > >  > >
 > >  >
 > >  > Sorry, I don't understand what that means.
 > >
 > > This is the main problem of my previous patch set V10, evicting clean inode
 > > expects no write behavior but in the case of dirty upper inode we have to
 > > write out dirty data in this timing otherwise we will lose the connection with upper inode.
 > >
 > 
 > My thinking was that the suspect list holds a reference to the overlay inode.
 > The question is can we always safely get rid of that reference and remove
 > from the suspect list when the inode is no longer "writable". Let's see...
 > 
 > >
 > >  >
 > >  > >
 > >  > >  > >  >
 > >  > >  > >  > Did I miss anything?
 > >  > >  > >  >
 > >  > >  > >
 > >  > >  > > If we dirty overlay inode that means we have launched writeback mechanism,
 > >  > >  > > so in this case, re-dirty overlay inode in time becomes important.
 > >  > >  > >
 > >  > >  >
 > >  > >  > My idea was to use the first call to ovl_sync_fs() with 'wait' false
 > >  > >  > to iterate the
 > >  > >  > maybe-dirty list and re-dirty overlay inodes whose upper is dirty.
 > >  > >  >
 > >  > >
 > >  > > I'm curious how we prevent dropping of clean overlay inode with dirty upper?
 > >  > > Hold another reference during iput_final operation? in the drop_inode() or something
 > >  > > else?
 > >  >
 > >  > No, just return 0 from ovl_drop_inode() and iput_final() will not evict().
 > >
 > > It's not good,  it  only temporarily  skips eviction, the inode in lru list
 > > will be evicted in some cases like drop cache or memory reclaim. etc.
 > >
 > > A solution for this is getting another reference in ->drop_inode so that
 > > the inode can escape from adding to lru list but this looks awkward and tricky.
 > >
 > 
 > Right, that was nonsense. We need to rely on the reference held by the
 > suspect list.
 > 
 > >  >
 > >  > >
 > >  > >
 > >  > >  > Then in the second call to __sync_filesystem, sync_inodes_sb() will take
 > >  > >  > care of calling ovl_write_inode() for all the re-dirty inodes.
 > >  > >  >
 > >  > >  > In current code we sync ALL dirty upper fs inodes and we do not overlay
 > >  > >  > inodes with no reference in cache.
 > >  > >  >
 > >  > >  > The best code would sync only upper fs inodes dirtied by this overlay
 > >  > >  > and will be able to evict overlay inodes whose upper inodes are clean.
 > >  > >  >
 > >  > >  > The compromise code would sync only upper fs inodes dirtied by this overlay,
 > >  > >  > and would not evict overlay inodes as long as upper inodes are "open for write".
 > >  > >  > I think its a fine compromise considering the alternatives.
 > >  > >  >
 > >  > >  > Is this workable?
 > >  > >  >
 > >  > >
 > >  > > In your approach, the key point is how to prevent dropping overlay inode that has
 > >  > > dirty upper and no reference but I don't understand well how to achieve it from
 > >  > > your descriptions.
 > >  > >
 > >  > >
 > >  >
 > >  > Very well, I will try to explain with code:
 > >  >
 > >  > int ovl_inode_is_open_for_write(struct inode *inode)
 > >  > {
 > >  >        struct inode *upper_inode = ovl_inode_upper(inode);
 > >  >
 > >  >        return upper_inode && inode_is_open_for_write(upper_inode);
 > >  > }
 > >  >
 > >  > void ovl_maybe_mark_inode_dirty(struct inode *inode)
 > >  > {
 > >  >        struct inode *upper_inode = ovl_inode_upper(inode);
 > >  >
 > >  >        if (upper_inode && upper_inode->i_state & I_DIRTY_ALL)
 > >  >                 mark_inode_dirty(inode);
 > >  > }
 > >  >
 > >  > int ovl_open(struct inode *inode, struct file *file)
 > >  > {
 > >  > ...
 > >  >        if (ovl_inode_is_open_for_write(file_inode(file)))
 > >  >                ovl_add_inode_to_suspect_list(inode);
 > >  >
 > >  >         file->private_data = realfile;
 > >  >
 > >  >         return 0;
 > >  > }
 > >  >
 > >  > int ovl_release(struct inode *inode, struct file *file)
 > >  > {
 > >  >        struct inode *inode = file_inode(file);
 > >  >
 > >  >        if (ovl_inode_is_open_for_write(inode)) {
 > >  >                ovl_maybe_mark_inode_dirty(inode);
 > >  >                ovl_remove_inode_from_suspect_list(inode);
 > >
 > > I think in some cases removing from suspect_list will cause losing
 > > the connection with upper inode that has writable mmap.
 > >
 > 
 > First of all I had a bug here.
 > Need to check for !ovl_inode_is_open_for_write(inode) after fput().
 > 
 > If the upper inode has a writable mmap, the upper inode would still
 > be "writable" (i_writecount held by the map realfile reference).
 > 
 > So when closing the last overlay file reference while upper inode
 > writable maps still exist, the remaining issue is when to remove
 > the overlay inode from the suspect list and allow its eviction and
 > I did not mention that.
 > 
 > I *think* that this condition should be safe in the regard that
 > after the condition is met, there is no way to dirty the upper inode
 > via overlayfs without going through ovl_open().
 > Obviously, the test should be done with some list lock held.
 > 
 > bool ovl_may_remove_from_suspect_list(struct inode *inode)
 > {
 >         struct inode *upper_inode = ovl_inode_upper(inode);
 > 
 >         if (upper_inode && upper_inode->i_state & I_DIRTY_ALL)
 >                 return false;
 > 
 >         return !inode_is_open_for_write(upper_inode);
 > }
 > 
 > Now remains the question of WHEN to check for removal
 > from the suspect list.
 > 
 > The first place is in ovl_sync_fs() when iterating the suspect list,
 > inodes that meet the above criteria are "indefinitely clean" and
 > may be removed from the list at that timing.
 > 
 > For eviction during memory pressure, overlay can register a
 > shrinker to do this garbage collection. Is shrinker being called
 > on drop_caches? I'm not sure. But we can also do periodic garbage
 > collection.
 > 
 > In the end, it is not the common case and we just need the garbage
 > collection to mitigate DoS (or existing workload) that does many:
 > - open
 > - mmap(...PROT_WRITE, MAP_SHARED...)
 > - close
 > - munmap
 > 

That right, as you mentioned above releasing timing is important and
that also means in worst case, overlayfs hold too much memory resource
until umount. Maybe more proactive shrinker like dedicated workqueue
thread of overlayfs instead of global shrinker will be more helpful.



 > 
 > >
 > >  >        }
 > >  >
 > >  >         fput(file->private_data);
 > >  >
 > >  >         return 0;
 > >  > }
 > >  >
 > >  > int ovl_drop_inode(struct inode *inode)
 > >  > {
 > >  >        struct inode *upper_inode = ovl_inode_upper(inode);
 > >  >
 > >  >        if (!upper_inode)
 > >  >                return 1;
 > >  >        if (upper_inode->i_state & I_DIRTY_ALL)
 > >  >                return 0;
 > >  >
 > >  >        return !inode_is_open_for_write(upper_inode);
 > >
 > > Is this condition just for writable mmap?
 > >
 > 
 > No, it's for all inodes that may be written from overlayfs (or
 > from maps created by overalyfs), but as you wrote, this
 > test is not needed in drop_inode().
 > 
 > >
 > >  > }
 > >  >
 > >  > static int ovl_sync_fs(struct super_block *sb, int wait)
 > >  > {
 > >  >         struct ovl_fs *ofs = sb->s_fs_info;
 > >  >         struct super_block *upper_sb;
 > >  >         int ret;
 > >  >
 > >  >         if (!ovl_upper_mnt(ofs))
 > >  >                 return 0;
 > >  >
 > >  >         /*
 > >  >          * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC).
 > >  >          * All the super blocks will be iterated, including upper_sb.
 > >  >          *
 > >  >          * If this is a syncfs(2) call, then we do need to call
 > >  >          * sync_filesystem() on upper_sb, but enough if we do it when being
 > >  >          * called with wait == 1.
 > >  >          */
 > >  >         if (!wait) {
 > >  >                 /* mark inodes on the suspect list dirty if thier
 > >  > upper inode is dirty */
 > >  >                 ovl_mark_suspect_list_inodes_dirty();
 > >  >                 return 0;
 > >  >         }
 > >  > ...
 > >  >
 > >
 > > Why 2 rounds?  it seems the simplest way is only syncing dirty upper inode
 > > on wait==1 just like my previous patch.
 > >
 > 
 > We don't have to do it in 2 rounds, but as long as we have a suspect list,
 > we can use it to start writeback on wait==0 on all dirty upper inodes from
 > our list just like the caller intended.
 > 
 > Do we need overlay sbi and mark_inode_dirty() and ovl_write_inode() at all?
 > I'm not sure. It feels like a good idea to use generic infrastructure as much as
 > possible. You should know better than me to answer this question.
 > 

IMO, if we maintain a special list like suspect-list to organize target overlay inodes,
then we don't have to mark overlay inode dirty, marking overlay inode dirty will
bring unnecessary complexity in this case and I think there is no special benefit.

So this approach may achieve containerized syncfs for overlayfs and I've submitted
patch set V12 follow this thinking but I think the complexity is still big obstacle to
merge to upstream from feedback of Miklos. The new approach I posted the other day
is really simple to understand compare to the previous solution, if we can optimize
dirtiness  notification part a bit more then I think it will be more acceptable.

In the end, I need more feedback and  If you guys all think the previous solution  is the right way, 
then I can do more work based on our discussion above and post V13 later.


Thanks,
Chengguang

Jan Kara Oct. 16, 2020, 9:22 a.m. UTC | #17

On Fri 16-10-20 15:09:38, Chengguang Xu wrote:
>  ---- 在 星期四, 2020-10-15 12:57:41 Al Viro <viro@zeniv.linux.org.uk> 撰写 ----
>  > On Thu, Oct 15, 2020 at 11:42:51AM +0800, Chengguang Xu wrote:
>  > >  ---- 在 星期四, 2020-10-15 11:25:01 Al Viro <viro@zeniv.linux.org.uk> 撰写 ----
>  > >  > On Sat, Oct 10, 2020 at 10:23:51PM +0800, Chengguang Xu wrote:
>  > >  > > Currently there is no notification api for kernel about modification
>  > >  > > of vfs inode, in some use cases like overlayfs, this kind of notification
>  > >  > > will be very helpful to implement containerized syncfs functionality.
>  > >  > > As the first attempt, we introduce marking inode dirty notification so that
>  > >  > > overlay's inode could mark itself dirty as well and then only sync dirty
>  > >  > > overlay inode while syncfs.
>  > >  > 
>  > >  > Who's responsible for removing the crap from notifier chain?  And how does
>  > >  > that affect the lifetime of inode?
>  > >  
>  > > In this case, overlayfs unregisters call back from the notifier chain of upper inode
>  > > when evicting it's own  inode. It will not affect the lifetime of upper inode because
>  > > overlayfs inode holds a reference of upper inode that means upper inode will not be
>  > > evicted while overlayfs inode is still alive.
>  > 
>  > Let me see if I've got it right:
>  >     * your chain contains 1 (for upper inodes) or 0 (everything else, i.e. the
>  > vast majority of inodes) recepients
>  >     * recepient pins the inode for as long as the recepient exists
>  > 
>  > That looks like a massive overkill, especially since all you are propagating is
>  > dirtying the suckers.  All you really need is one bit in your inode + hash table
>  > indexed by the address of struct inode (well, middle bits thereof, as usual).
>  > With entries embedded into overlayfs-private part of overlayfs inode.  And callback
>  > to be called stored in that entry...
>  > 
> 
> Hi AI, Jack, Amir
> 
> Based on your feedback, I would to change the inode dirty notification
> something like below, is it acceptable? 
> 
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 1492271..48473d9 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -2249,6 +2249,14 @@ void __mark_inode_dirty(struct inode *inode, int flags)
>  
>         trace_writeback_mark_inode_dirty(inode, flags);
>  
> +       if (inode->state & I_OVL_INUSE) {
> +               struct inode *ovl_inode;
> +
> +               ovl_inode = ilookup5(NULL, (unsigned long)inode, ovl_inode_test, inode);

I don't think this will work - superblock pointer is part of the hash value
inode is hashed with so without proper sb pointer you won't find proper
hash chain.

								Honza

[RFC,1/5] fs: introduce notifier list for vfs inode

Commit Message

Comments

Patch