Message ID | 20211011030052.98923-1-jefflexu@linux.alibaba.com (mailing list archive) |
---|---|
Headers | show |
Series | fuse,virtiofs: support per-file DAX | expand |
Hi, any comment? On 10/11/21 11:00 AM, Jeffle Xu wrote: > changes since v5: > Overall Design Changes: > 1. virtiofsd now supports ioctl (only FS_IOC_SETFLAGS and > FS_IOC_FSSETXATTR), so that users inside guest could set/clear > persistent inode flags now. (FUSE kernel module has already supported > .ioctl(), virtiofsd need to suuport it.) > 2. When FUSE client is mounted with '-o dax=inode', it indicates that > whether DAX shall be enabled or not for one specific file is > completely determined by FUSE server while FUSE client has no say on > it, and the decision whether DAX shall be enabled or not for specific > file is communicated through FUSE_ATTR_DAX flag of FUSE protocol. The > algorithm used by virtiofsd to determine whether DAX shall be enabled > or not is totally implementation specific, and thus the following > scenario may exist: users inside guest has already set related persistent > inode flag (i.e. FS_XFLAG_DAX) on corresponding file but FUSE server finnaly > decides not to enable DAX for this file. This slight semantic difference > is documented in patch 7. Also because of this, d_mark_dontcache() is > not called when FS_IOC_SETFLAGS/FS_IOC_FSSETXATTR ioctl is done inside > guest. It's delayed to be done if the FUSE_ATTR_DAX flag **indeed** > changes (as showed in patch 6). > 3. patch 1: slightly modify logic of fuse_should_enable_dax() > 4. patch 4: add back negotiation during FUSE_INIT. FUSE client shall > advertise to FUSE server that it's in per-file DAX mode, so that FUSE > server may omit querying persistent inode flags on host if FUSE client > is not mounted in per-file DAX mode, giving querying persistent inode > flags could be quite expensive. > > > chanegs since v4: > - drop support for setting/clearing FS_DAX inside guest > - and thus drop the negotiation phase during FUSE_INIT > > This patchset adds support of per-file DAX for virtiofs, which is > inspired by Ira Weiny's work on ext4[1] and xfs[2]. > > Any comment is welcome. > > [1] commit 9cb20f94afcd ("fs/ext4: Make DAX mount option a tri-state") > [2] commit 02beb2686ff9 ("fs/xfs: Make DAX mount option a tri-state") > > [Purpose] > DAX may be limited in some specific situation. When the number of usable > DAX windows is under watermark, the recalim routine will be triggered to > reclaim some DAX windows. It may have a negative impact on the > performance, since some processes may need to wait for DAX windows to be > recalimed and reused then. To mitigate the performance degradation, the > overall DAX window need to be expanded larger. > > However, simply expanding the DAX window may not be a good deal in some > scenario. To maintain one DAX window chunk (i.e., 2MB in size), 32KB > (512 * 64 bytes) memory footprint will be consumed for page descriptors > inside guest, which is greater than the memory footprint if it uses > guest page cache when DAX disabled. Thus it'd better disable DAX for > those files smaller than 32KB, to reduce the demand for DAX window and > thus avoid the unworthy memory overhead. > > Per-file DAX feature is introduced to address this issue, by offering a > finer grained control for dax to users, trying to achieve a balance > between performance and memory overhead. > > > [Note] > When the per-file DAX hint changes while the file is still *opened*, it > is quite complicated and maybe fragile to dynamically change the DAX > state, since dynamic switching needs to switch a_ops atomiclly. Ira > Weiny had ever implemented a so called i_aops_sem lock [3] but > eventually gave up since the complexity of the implementation > [4][5][6][7]. > > Hence mark the inode and corresponding dentries as DONE_CACHE once the > per-file DAX hint changes, so that the inode instance will be evicted > and freed as soon as possible once the file is closed and the last > reference to the inode is put. And then when the file gets reopened next > time, the new instantiated inode will reflect the new DAX state. > > In summary, when the per-file DAX hint changes for an *opened* file, the > DAX state of the file won't be updated until this file is closed and > reopened later. This is also how ext4/xfs per-file DAX works. > > [3] https://lore.kernel.org/lkml/20200227052442.22524-7-ira.weiny@intel.com/ > [4] https://patchwork.kernel.org/project/xfs/cover/20200407182958.568475-1-ira.weiny@intel.com/ > [5] https://lore.kernel.org/lkml/20200305155144.GA5598@lst.de/ > [6] https://lore.kernel.org/lkml/20200401040021.GC56958@magnolia/ > [7] https://lore.kernel.org/lkml/20200403182904.GP80283@magnolia/ > > changes since v3: > - bug fix (patch 6): s/"IS_DAX(inode) != newdax"/"!!IS_DAX(inode) != > newdax" > - during FUSE_INIT, advertise capability for per-file DAX only when > mounted as "-o dax=inode" (patch 4) > > changes since v2: > - modify fuse_show_options() accordingly to make it compatible with > new tri-state mount option (patch 2) > - extract FUSE protocol changes into one separate patch (patch 3) > - FUSE server/client need to negotiate if they support per-file DAX > (patch 4) > - extract DONT_CACHE logic into patch 6/7 > > v5: https://lore.kernel.org/all/20210923092526.72341-1-jefflexu@linux.alibaba.com/ > v4: https://lore.kernel.org/linux-fsdevel/20210817022220.17574-1-jefflexu@linux.alibaba.com/ > v3: https://www.spinics.net/lists/linux-fsdevel/msg200852.html > v2: https://www.spinics.net/lists/linux-fsdevel/msg199584.html > v1: https://www.spinics.net/lists/linux-virtualization/msg51008.html > > > Jeffle Xu (7): > fuse: add fuse_should_enable_dax() helper > fuse: make DAX mount option a tri-state > fuse: support per-file DAX in fuse protocol > fuse: negotiate per-file DAX in FUSE_INIT > fuse: enable per-file DAX > fuse: mark inode DONT_CACHE when per-file DAX hint changes > Documentation/filesystem/dax: record DAX on virtiofs > > Documentation/filesystems/dax.rst | 20 +++++++++++++++-- > fs/fuse/dax.c | 36 ++++++++++++++++++++++++++++--- > fs/fuse/file.c | 4 ++-- > fs/fuse/fuse_i.h | 19 ++++++++++++---- > fs/fuse/inode.c | 17 +++++++++++---- > fs/fuse/virtio_fs.c | 16 ++++++++++++-- > include/uapi/linux/fuse.h | 9 +++++++- > 7 files changed, 103 insertions(+), 18 deletions(-) >
On Mon, Oct 11, 2021 at 11:00:45AM +0800, Jeffle Xu wrote: > changes since v5: > Overall Design Changes: > 1. virtiofsd now supports ioctl (only FS_IOC_SETFLAGS and > FS_IOC_FSSETXATTR), so that users inside guest could set/clear > persistent inode flags now. (FUSE kernel module has already supported > .ioctl(), virtiofsd need to suuport it.) So no changes needed in fuse side (kernel) to support FS_IOC_FSSETXATTR? Only virtiofsd needs to be changed. That sounds good. Vivek > 2. When FUSE client is mounted with '-o dax=inode', it indicates that > whether DAX shall be enabled or not for one specific file is > completely determined by FUSE server while FUSE client has no say on > it, and the decision whether DAX shall be enabled or not for specific > file is communicated through FUSE_ATTR_DAX flag of FUSE protocol. The > algorithm used by virtiofsd to determine whether DAX shall be enabled > or not is totally implementation specific, and thus the following > scenario may exist: users inside guest has already set related persistent > inode flag (i.e. FS_XFLAG_DAX) on corresponding file but FUSE server finnaly > decides not to enable DAX for this file. This slight semantic difference > is documented in patch 7. Also because of this, d_mark_dontcache() is > not called when FS_IOC_SETFLAGS/FS_IOC_FSSETXATTR ioctl is done inside > guest. It's delayed to be done if the FUSE_ATTR_DAX flag **indeed** > changes (as showed in patch 6). > 3. patch 1: slightly modify logic of fuse_should_enable_dax() > 4. patch 4: add back negotiation during FUSE_INIT. FUSE client shall > advertise to FUSE server that it's in per-file DAX mode, so that FUSE > server may omit querying persistent inode flags on host if FUSE client > is not mounted in per-file DAX mode, giving querying persistent inode > flags could be quite expensive. > > > chanegs since v4: > - drop support for setting/clearing FS_DAX inside guest > - and thus drop the negotiation phase during FUSE_INIT > > This patchset adds support of per-file DAX for virtiofs, which is > inspired by Ira Weiny's work on ext4[1] and xfs[2]. > > Any comment is welcome. > > [1] commit 9cb20f94afcd ("fs/ext4: Make DAX mount option a tri-state") > [2] commit 02beb2686ff9 ("fs/xfs: Make DAX mount option a tri-state") > > [Purpose] > DAX may be limited in some specific situation. When the number of usable > DAX windows is under watermark, the recalim routine will be triggered to > reclaim some DAX windows. It may have a negative impact on the > performance, since some processes may need to wait for DAX windows to be > recalimed and reused then. To mitigate the performance degradation, the > overall DAX window need to be expanded larger. > > However, simply expanding the DAX window may not be a good deal in some > scenario. To maintain one DAX window chunk (i.e., 2MB in size), 32KB > (512 * 64 bytes) memory footprint will be consumed for page descriptors > inside guest, which is greater than the memory footprint if it uses > guest page cache when DAX disabled. Thus it'd better disable DAX for > those files smaller than 32KB, to reduce the demand for DAX window and > thus avoid the unworthy memory overhead. > > Per-file DAX feature is introduced to address this issue, by offering a > finer grained control for dax to users, trying to achieve a balance > between performance and memory overhead. > > > [Note] > When the per-file DAX hint changes while the file is still *opened*, it > is quite complicated and maybe fragile to dynamically change the DAX > state, since dynamic switching needs to switch a_ops atomiclly. Ira > Weiny had ever implemented a so called i_aops_sem lock [3] but > eventually gave up since the complexity of the implementation > [4][5][6][7]. > > Hence mark the inode and corresponding dentries as DONE_CACHE once the > per-file DAX hint changes, so that the inode instance will be evicted > and freed as soon as possible once the file is closed and the last > reference to the inode is put. And then when the file gets reopened next > time, the new instantiated inode will reflect the new DAX state. > > In summary, when the per-file DAX hint changes for an *opened* file, the > DAX state of the file won't be updated until this file is closed and > reopened later. This is also how ext4/xfs per-file DAX works. > > [3] https://lore.kernel.org/lkml/20200227052442.22524-7-ira.weiny@intel.com/ > [4] https://patchwork.kernel.org/project/xfs/cover/20200407182958.568475-1-ira.weiny@intel.com/ > [5] https://lore.kernel.org/lkml/20200305155144.GA5598@lst.de/ > [6] https://lore.kernel.org/lkml/20200401040021.GC56958@magnolia/ > [7] https://lore.kernel.org/lkml/20200403182904.GP80283@magnolia/ > > changes since v3: > - bug fix (patch 6): s/"IS_DAX(inode) != newdax"/"!!IS_DAX(inode) != > newdax" > - during FUSE_INIT, advertise capability for per-file DAX only when > mounted as "-o dax=inode" (patch 4) > > changes since v2: > - modify fuse_show_options() accordingly to make it compatible with > new tri-state mount option (patch 2) > - extract FUSE protocol changes into one separate patch (patch 3) > - FUSE server/client need to negotiate if they support per-file DAX > (patch 4) > - extract DONT_CACHE logic into patch 6/7 > > v5: https://lore.kernel.org/all/20210923092526.72341-1-jefflexu@linux.alibaba.com/ > v4: https://lore.kernel.org/linux-fsdevel/20210817022220.17574-1-jefflexu@linux.alibaba.com/ > v3: https://www.spinics.net/lists/linux-fsdevel/msg200852.html > v2: https://www.spinics.net/lists/linux-fsdevel/msg199584.html > v1: https://www.spinics.net/lists/linux-virtualization/msg51008.html > > > Jeffle Xu (7): > fuse: add fuse_should_enable_dax() helper > fuse: make DAX mount option a tri-state > fuse: support per-file DAX in fuse protocol > fuse: negotiate per-file DAX in FUSE_INIT > fuse: enable per-file DAX > fuse: mark inode DONT_CACHE when per-file DAX hint changes > Documentation/filesystem/dax: record DAX on virtiofs > > Documentation/filesystems/dax.rst | 20 +++++++++++++++-- > fs/fuse/dax.c | 36 ++++++++++++++++++++++++++++--- > fs/fuse/file.c | 4 ++-- > fs/fuse/fuse_i.h | 19 ++++++++++++---- > fs/fuse/inode.c | 17 +++++++++++---- > fs/fuse/virtio_fs.c | 16 ++++++++++++-- > include/uapi/linux/fuse.h | 9 +++++++- > 7 files changed, 103 insertions(+), 18 deletions(-) > > -- > 2.27.0 >
On 10/18/21 11:21 PM, Vivek Goyal wrote: > On Mon, Oct 11, 2021 at 11:00:45AM +0800, Jeffle Xu wrote: >> changes since v5: >> Overall Design Changes: >> 1. virtiofsd now supports ioctl (only FS_IOC_SETFLAGS and >> FS_IOC_FSSETXATTR), so that users inside guest could set/clear >> persistent inode flags now. (FUSE kernel module has already supported >> .ioctl(), virtiofsd need to suuport it.) > > So no changes needed in fuse side (kernel) to support FS_IOC_FSSETXATTR? > Only virtiofsd needs to be changed. That sounds good. > Yes, the fuse kernel modules has already supported FUSE_IOCTL. Per inode DAX on ext4/xfs will also call d_mark_dontcache() and try to evict this inode as soon as possible when the persistent (DAX) inode attribute has changed, just like [1]. But because of following reason: > >> 2. The >> algorithm used by virtiofsd to determine whether DAX shall be enabled >> or not is totally implementation specific, and thus the following >> scenario may exist: users inside guest has already set related persistent >> inode flag (i.e. FS_XFLAG_DAX) on corresponding file but FUSE server finnaly >> decides not to enable DAX for this file. If we always call d_mark_dontcache() and try to evict this inode when the persistent (DAX) inode attribute has changed, the DAX state returned by virtiofsd may sustain the same, and thus the previous eviction is totally wasted and unnecessary. So, as the following said, >> Also because of this, d_mark_dontcache() is >> not called when FS_IOC_SETFLAGS/FS_IOC_FSSETXATTR ioctl is done inside >> guest. It's delayed to be done if the FUSE_ATTR_DAX flag **indeed** >> changes (as showed in patch 6). the call for d_mark_dontcache() and inode eviction is delayed when the DAX state returned by virtiofsd **indeed** changed (when dentry is timed out and a new FUSE_LOOKUP is requested). But the defect is that, if '-o cache=always' is set for virtiofsd, then the DAX state won't be updated for a long time, after users have changed the persistent (DAX) inode attribute inside guest via FS_IOC_FSSETXATTR ioctl. [1] https://www.spinics.net/lists/linux-fsdevel/msg200851.html
On Wed, Oct 20, 2021 at 01:22:32PM +0800, JeffleXu wrote: > > > On 10/18/21 11:21 PM, Vivek Goyal wrote: > > On Mon, Oct 11, 2021 at 11:00:45AM +0800, Jeffle Xu wrote: > >> changes since v5: > >> Overall Design Changes: > >> 1. virtiofsd now supports ioctl (only FS_IOC_SETFLAGS and > >> FS_IOC_FSSETXATTR), so that users inside guest could set/clear > >> persistent inode flags now. (FUSE kernel module has already supported > >> .ioctl(), virtiofsd need to suuport it.) > > > > So no changes needed in fuse side (kernel) to support FS_IOC_FSSETXATTR? > > Only virtiofsd needs to be changed. That sounds good. > > > > Yes, the fuse kernel modules has already supported FUSE_IOCTL. > > Per inode DAX on ext4/xfs will also call d_mark_dontcache() and try to > evict this inode as soon as possible when the persistent (DAX) inode > attribute has changed, just like [1]. > > But because of following reason: > > > >> 2. The > >> algorithm used by virtiofsd to determine whether DAX shall be enabled > >> or not is totally implementation specific, and thus the following > >> scenario may exist: users inside guest has already set related persistent > >> inode flag (i.e. FS_XFLAG_DAX) on corresponding file but FUSE server finnaly > >> decides not to enable DAX for this file. > > If we always call d_mark_dontcache() and try to evict this inode when > the persistent (DAX) inode attribute has changed, the DAX state returned > by virtiofsd may sustain the same, and thus the previous eviction is > totally wasted and unnecessary. > > So, as the following said, > > >> Also because of this, d_mark_dontcache() is > >> not called when FS_IOC_SETFLAGS/FS_IOC_FSSETXATTR ioctl is done inside > >> guest. It's delayed to be done if the FUSE_ATTR_DAX flag **indeed** > >> changes (as showed in patch 6). > > the call for d_mark_dontcache() and inode eviction is delayed when the > DAX state returned by virtiofsd **indeed** changed (when dentry is timed > out and a new FUSE_LOOKUP is requested). But the defect is that, if '-o > cache=always' is set for virtiofsd, then the DAX state won't be updated > for a long time, after users have changed the persistent (DAX) inode > attribute inside guest via FS_IOC_FSSETXATTR ioctl. Good point. I guess this probably is not too bad. If it becomes a concern, we can always mark inode don't cache whenever client changes persistent DAX flag. Vivek > > > > [1] https://www.spinics.net/lists/linux-fsdevel/msg200851.html > > -- > Thanks, > Jeffle >