mbox series

[v5,0/5] fuse,virtiofs: support per-file DAX

Message ID 20210923092526.72341-1-jefflexu@linux.alibaba.com (mailing list archive)
Headers show
Series fuse,virtiofs: support per-file DAX | expand

Message

Jingbo Xu Sept. 23, 2021, 9:25 a.m. UTC
This patchset adds support of per-file DAX for virtiofs, which is
inspired by Ira Weiny's work on ext4[1] and xfs[2].

Any comment is welcome.

[1] commit 9cb20f94afcd ("fs/ext4: Make DAX mount option a tri-state")
[2] commit 02beb2686ff9 ("fs/xfs: Make DAX mount option a tri-state")


[Purpose]
DAX may be limited in some specific situation. When the number of usable
DAX windows is under watermark, the recalim routine will be triggered to
reclaim some DAX windows. It may have a negative impact on the
performance, since some processes may need to wait for DAX windows to be
recalimed and reused then. To mitigate the performance degradation, the
overall DAX window need to be expanded larger.

However, simply expanding the DAX window may not be a good deal in some
scenario. To maintain one DAX window chunk (i.e., 2MB in size), 32KB
(512 * 64 bytes) memory footprint will be consumed for page descriptors
inside guest, which is greater than the memory footprint if it uses
guest page cache when DAX disabled. Thus it'd better disable DAX for
those files smaller than 32KB, to reduce the demand for DAX window and
thus avoid the unworthy memory overhead.

Per-file DAX feature is introduced to address this issue, by offering a
finer grained control for dax to users, trying to achieve a balance
between performance and memory overhead.


[Note]
When the per-file DAX hint changes while the file is still *opened*, it
is quite complicated and maybe fragile to dynamically change the DAX
state, since dynamic switching needs to switch a_ops atomiclly. Ira
Weiny had ever implemented a so called i_aops_sem lock [3] but
eventually gave up since the complexity of the implementation [4][5][6][7].

Hence mark the inode and corresponding dentries as DONE_CACHE once the
per-file DAX hint changes, so that the inode instance will be evicted
and freed as soon as possible once the file is closed and the last
reference to the inode is put. And then when the file gets reopened next
time, the new instantiated inode will reflect the new DAX state.

In summary, when the per-file DAX hint changes for an *opened* file, the
DAX state of the file won't be updated until this file is closed and
reopened later. This is also how ext4/xfs per-file DAX works.

[3] https://lore.kernel.org/lkml/20200227052442.22524-7-ira.weiny@intel.com/
[4] https://patchwork.kernel.org/project/xfs/cover/20200407182958.568475-1-ira.weiny@intel.com/
[5] https://lore.kernel.org/lkml/20200305155144.GA5598@lst.de/
[6] https://lore.kernel.org/lkml/20200401040021.GC56958@magnolia/
[7] https://lore.kernel.org/lkml/20200403182904.GP80283@magnolia/

chanegs since v4:
- drop support for setting/clearing FS_DAX inside guest
- and thus drop the negotiation phase during FUSE_INIT

changes since v3:
- bug fix (patch 6): s/"IS_DAX(inode) != newdax"/"!!IS_DAX(inode) !=
  newdax"
- during FUSE_INIT, advertise capability for per-file DAX only when
  mounted as "-o dax=inode" (patch 4)

changes since v2:
- modify fuse_show_options() accordingly to make it compatible with
  new tri-state mount option (patch 2)
- extract FUSE protocol changes into one seperate patch (patch 3)
- FUSE server/client need to negotiate if they support per-file DAX
  (patch 4)
- extract DONT_CACHE logic into patch 6/7

v4: https://lore.kernel.org/linux-fsdevel/20210817022220.17574-1-jefflexu@linux.alibaba.com/
v3: https://www.spinics.net/lists/linux-fsdevel/msg200852.html
v2: https://www.spinics.net/lists/linux-fsdevel/msg199584.html
v1: https://www.spinics.net/lists/linux-virtualization/msg51008.html


Jeffle Xu (5):
  fuse: add fuse_should_enable_dax() helper
  fuse: make DAX mount option a tri-state
  fuse: support per-file DAX
  fuse: enable per-file DAX
  fuse: mark inode DONT_CACHE when per-file DAX hint changes

 fs/fuse/dax.c             | 32 +++++++++++++++++++++++++++++---
 fs/fuse/file.c            |  4 ++--
 fs/fuse/fuse_i.h          | 19 +++++++++++++++----
 fs/fuse/inode.c           | 15 +++++++++++----
 fs/fuse/virtio_fs.c       | 16 ++++++++++++++--
 include/uapi/linux/fuse.h |  7 ++++++-
 6 files changed, 77 insertions(+), 16 deletions(-)

Comments

Vivek Goyal Sept. 23, 2021, 6:57 p.m. UTC | #1
On Thu, Sep 23, 2021 at 05:25:21PM +0800, Jeffle Xu wrote:
> This patchset adds support of per-file DAX for virtiofs, which is
> inspired by Ira Weiny's work on ext4[1] and xfs[2].
> 
> Any comment is welcome.
> 
> [1] commit 9cb20f94afcd ("fs/ext4: Make DAX mount option a tri-state")
> [2] commit 02beb2686ff9 ("fs/xfs: Make DAX mount option a tri-state")
> 
> 
> [Purpose]
> DAX may be limited in some specific situation. When the number of usable
> DAX windows is under watermark, the recalim routine will be triggered to
> reclaim some DAX windows. It may have a negative impact on the
> performance, since some processes may need to wait for DAX windows to be
> recalimed and reused then. To mitigate the performance degradation, the
> overall DAX window need to be expanded larger.
> 
> However, simply expanding the DAX window may not be a good deal in some
> scenario. To maintain one DAX window chunk (i.e., 2MB in size), 32KB
> (512 * 64 bytes) memory footprint will be consumed for page descriptors
> inside guest, which is greater than the memory footprint if it uses
> guest page cache when DAX disabled. Thus it'd better disable DAX for
> those files smaller than 32KB, to reduce the demand for DAX window and
> thus avoid the unworthy memory overhead.
> 
> Per-file DAX feature is introduced to address this issue, by offering a
> finer grained control for dax to users, trying to achieve a balance
> between performance and memory overhead.
> 
> 
> [Note]
> When the per-file DAX hint changes while the file is still *opened*, it
> is quite complicated and maybe fragile to dynamically change the DAX
> state, since dynamic switching needs to switch a_ops atomiclly. Ira
> Weiny had ever implemented a so called i_aops_sem lock [3] but
> eventually gave up since the complexity of the implementation [4][5][6][7].
> 
> Hence mark the inode and corresponding dentries as DONE_CACHE once the
> per-file DAX hint changes, so that the inode instance will be evicted
> and freed as soon as possible once the file is closed and the last
> reference to the inode is put. And then when the file gets reopened next
> time, the new instantiated inode will reflect the new DAX state.

If we don't cache inode (if no fd is open), will it not have negative
performance impact. When we cache inodes, we also have all the dax
mappings cached as well. So if a process opens the same file again,
it gets all the mappings already in place and it does not have
to call FUSE_SETUPMAPPING again.

Vivek

> 
> In summary, when the per-file DAX hint changes for an *opened* file, the
> DAX state of the file won't be updated until this file is closed and
> reopened later. This is also how ext4/xfs per-file DAX works.
> 
> [3] https://lore.kernel.org/lkml/20200227052442.22524-7-ira.weiny@intel.com/
> [4] https://patchwork.kernel.org/project/xfs/cover/20200407182958.568475-1-ira.weiny@intel.com/
> [5] https://lore.kernel.org/lkml/20200305155144.GA5598@lst.de/
> [6] https://lore.kernel.org/lkml/20200401040021.GC56958@magnolia/
> [7] https://lore.kernel.org/lkml/20200403182904.GP80283@magnolia/
> 
> chanegs since v4:
> - drop support for setting/clearing FS_DAX inside guest
> - and thus drop the negotiation phase during FUSE_INIT
> 
> changes since v3:
> - bug fix (patch 6): s/"IS_DAX(inode) != newdax"/"!!IS_DAX(inode) !=
>   newdax"
> - during FUSE_INIT, advertise capability for per-file DAX only when
>   mounted as "-o dax=inode" (patch 4)
> 
> changes since v2:
> - modify fuse_show_options() accordingly to make it compatible with
>   new tri-state mount option (patch 2)
> - extract FUSE protocol changes into one seperate patch (patch 3)
> - FUSE server/client need to negotiate if they support per-file DAX
>   (patch 4)
> - extract DONT_CACHE logic into patch 6/7
> 
> v4: https://lore.kernel.org/linux-fsdevel/20210817022220.17574-1-jefflexu@linux.alibaba.com/
> v3: https://www.spinics.net/lists/linux-fsdevel/msg200852.html
> v2: https://www.spinics.net/lists/linux-fsdevel/msg199584.html
> v1: https://www.spinics.net/lists/linux-virtualization/msg51008.html
> 
> 
> Jeffle Xu (5):
>   fuse: add fuse_should_enable_dax() helper
>   fuse: make DAX mount option a tri-state
>   fuse: support per-file DAX
>   fuse: enable per-file DAX
>   fuse: mark inode DONT_CACHE when per-file DAX hint changes
> 
>  fs/fuse/dax.c             | 32 +++++++++++++++++++++++++++++---
>  fs/fuse/file.c            |  4 ++--
>  fs/fuse/fuse_i.h          | 19 +++++++++++++++----
>  fs/fuse/inode.c           | 15 +++++++++++----
>  fs/fuse/virtio_fs.c       | 16 ++++++++++++++--
>  include/uapi/linux/fuse.h |  7 ++++++-
>  6 files changed, 77 insertions(+), 16 deletions(-)
> 
> -- 
> 2.27.0
>
Jingbo Xu Oct. 27, 2021, 3:42 a.m. UTC | #2
Sorry for the late reply, as your previous reply was moved to junk box
by the algorithm...

On 9/24/21 2:57 AM, Vivek Goyal wrote:
> On Thu, Sep 23, 2021 at 05:25:21PM +0800, Jeffle Xu wrote:
>> This patchset adds support of per-file DAX for virtiofs, which is
>> inspired by Ira Weiny's work on ext4[1] and xfs[2].
>>
>> Any comment is welcome.
>>
>> [1] commit 9cb20f94afcd ("fs/ext4: Make DAX mount option a tri-state")
>> [2] commit 02beb2686ff9 ("fs/xfs: Make DAX mount option a tri-state")
>>
>>
>> [Purpose]
>> DAX may be limited in some specific situation. When the number of usable
>> DAX windows is under watermark, the recalim routine will be triggered to
>> reclaim some DAX windows. It may have a negative impact on the
>> performance, since some processes may need to wait for DAX windows to be
>> recalimed and reused then. To mitigate the performance degradation, the
>> overall DAX window need to be expanded larger.
>>
>> However, simply expanding the DAX window may not be a good deal in some
>> scenario. To maintain one DAX window chunk (i.e., 2MB in size), 32KB
>> (512 * 64 bytes) memory footprint will be consumed for page descriptors
>> inside guest, which is greater than the memory footprint if it uses
>> guest page cache when DAX disabled. Thus it'd better disable DAX for
>> those files smaller than 32KB, to reduce the demand for DAX window and
>> thus avoid the unworthy memory overhead.
>>
>> Per-file DAX feature is introduced to address this issue, by offering a
>> finer grained control for dax to users, trying to achieve a balance
>> between performance and memory overhead.
>>
>>
>> [Note]
>> When the per-file DAX hint changes while the file is still *opened*, it
>> is quite complicated and maybe fragile to dynamically change the DAX
>> state, since dynamic switching needs to switch a_ops atomiclly. Ira
>> Weiny had ever implemented a so called i_aops_sem lock [3] but
>> eventually gave up since the complexity of the implementation [4][5][6][7].
>>
>> Hence mark the inode and corresponding dentries as DONE_CACHE once the
>> per-file DAX hint changes, so that the inode instance will be evicted
>> and freed as soon as possible once the file is closed and the last
>> reference to the inode is put. And then when the file gets reopened next
>> time, the new instantiated inode will reflect the new DAX state.
> 
> If we don't cache inode (if no fd is open), will it not have negative
> performance impact. When we cache inodes, we also have all the dax
> mappings cached as well. So if a process opens the same file again,
> it gets all the mappings already in place and it does not have
> to call FUSE_SETUPMAPPING again.
> 

What does 'all the dax mappings cached' mean when 'we cache inodes'?

If the per-file DAX hint indeed changes for a large sized file, with
quite many page caches or DAX mapping already in the address space, then
marking it DONT_CACHE means evicting the inode as soon as possible,
which means flushing the page caches or removing all DAX mappings. When
the inode is reopened next time, page cache is re-instantiated or
FUSE_SETUPMAPPING is called again. Then the negative performance impact
indeed exist in this case.

But this performance impact only exist when the per-file DAX hint
changes halfway, that is, the hint suddenly changes after the virtiofs
has already mounted in the guest.
Vivek Goyal Oct. 27, 2021, 2:45 p.m. UTC | #3
On Wed, Oct 27, 2021 at 11:42:52AM +0800, JeffleXu wrote:
> 
> Sorry for the late reply, as your previous reply was moved to junk box
> by the algorithm...
> 
> On 9/24/21 2:57 AM, Vivek Goyal wrote:
> > On Thu, Sep 23, 2021 at 05:25:21PM +0800, Jeffle Xu wrote:
> >> This patchset adds support of per-file DAX for virtiofs, which is
> >> inspired by Ira Weiny's work on ext4[1] and xfs[2].
> >>
> >> Any comment is welcome.
> >>
> >> [1] commit 9cb20f94afcd ("fs/ext4: Make DAX mount option a tri-state")
> >> [2] commit 02beb2686ff9 ("fs/xfs: Make DAX mount option a tri-state")
> >>
> >>
> >> [Purpose]
> >> DAX may be limited in some specific situation. When the number of usable
> >> DAX windows is under watermark, the recalim routine will be triggered to
> >> reclaim some DAX windows. It may have a negative impact on the
> >> performance, since some processes may need to wait for DAX windows to be
> >> recalimed and reused then. To mitigate the performance degradation, the
> >> overall DAX window need to be expanded larger.
> >>
> >> However, simply expanding the DAX window may not be a good deal in some
> >> scenario. To maintain one DAX window chunk (i.e., 2MB in size), 32KB
> >> (512 * 64 bytes) memory footprint will be consumed for page descriptors
> >> inside guest, which is greater than the memory footprint if it uses
> >> guest page cache when DAX disabled. Thus it'd better disable DAX for
> >> those files smaller than 32KB, to reduce the demand for DAX window and
> >> thus avoid the unworthy memory overhead.
> >>
> >> Per-file DAX feature is introduced to address this issue, by offering a
> >> finer grained control for dax to users, trying to achieve a balance
> >> between performance and memory overhead.
> >>
> >>
> >> [Note]
> >> When the per-file DAX hint changes while the file is still *opened*, it
> >> is quite complicated and maybe fragile to dynamically change the DAX
> >> state, since dynamic switching needs to switch a_ops atomiclly. Ira
> >> Weiny had ever implemented a so called i_aops_sem lock [3] but
> >> eventually gave up since the complexity of the implementation [4][5][6][7].
> >>
> >> Hence mark the inode and corresponding dentries as DONE_CACHE once the
> >> per-file DAX hint changes, so that the inode instance will be evicted
> >> and freed as soon as possible once the file is closed and the last
> >> reference to the inode is put. And then when the file gets reopened next
> >> time, the new instantiated inode will reflect the new DAX state.
> > 
> > If we don't cache inode (if no fd is open), will it not have negative
> > performance impact. When we cache inodes, we also have all the dax
> > mappings cached as well. So if a process opens the same file again,
> > it gets all the mappings already in place and it does not have
> > to call FUSE_SETUPMAPPING again.
> > 
> 
> What does 'all the dax mappings cached' mean when 'we cache inodes'?
> 
> If the per-file DAX hint indeed changes for a large sized file, with
> quite many page caches or DAX mapping already in the address space, then
> marking it DONT_CACHE means evicting the inode as soon as possible,
> which means flushing the page caches or removing all DAX mappings. When
> the inode is reopened next time, page cache is re-instantiated or
> FUSE_SETUPMAPPING is called again. Then the negative performance impact
> indeed exist in this case.
> 
> But this performance impact only exist when the per-file DAX hint
> changes halfway, that is, the hint suddenly changes after the virtiofs
> has already mounted in the guest.

Ok, got it. I think I saw that in the code. I had assumed that an inode
will always be marked don't cache. That's not the case. It will be
marked don't cache only if inode property changes (from dax to non-dax or
vice-a-versa). That seems fine.

Vivek