mbox series

[v4,0/8] fuse,virtiofs: support per-file DAX

Message ID 20210817022220.17574-1-jefflexu@linux.alibaba.com (mailing list archive)
Headers show
Series fuse,virtiofs: support per-file DAX | expand

Message

Jingbo Xu Aug. 17, 2021, 2:22 a.m. UTC
This patchset adds support of per-file DAX for virtiofs, which is
inspired by Ira Weiny's work on ext4[1] and xfs[2].

Any comment is welcome.

[1] commit 9cb20f94afcd ("fs/ext4: Make DAX mount option a tri-state")
[2] commit 02beb2686ff9 ("fs/xfs: Make DAX mount option a tri-state")


changes since v3:
- bug fix (patch 6): s/"IS_DAX(inode) != newdax"/"!!IS_DAX(inode) != newdax"
- during FUSE_INIT, advertise capability for per-file DAX only when
  mounted as "-o dax=inode" (patch 4)

changes since v2:
- modify fuse_show_options() accordingly to make it compatible with
  new tri-state mount option (patch 2)
- extract FUSE protocol changes into one seperate patch (patch 3)
- FUSE server/client need to negotiate if they support per-file DAX
  (patch 4)
- extract DONT_CACHE logic into patch 6/7

v3: https://www.spinics.net/lists/linux-fsdevel/msg200852.html
v2: https://www.spinics.net/lists/linux-fsdevel/msg199584.html
v1: https://www.spinics.net/lists/linux-virtualization/msg51008.html

Jeffle Xu (8):
  fuse: add fuse_should_enable_dax() helper
  fuse: Make DAX mount option a tri-state
  fuse: support per-file DAX
  fuse: negotiate if server/client supports per-file DAX
  fuse: enable per-file DAX
  fuse: mark inode DONT_CACHE when per-file DAX indication changes
  fuse: support changing per-file DAX flag inside guest
  fuse: show '-o dax=inode' option only when FUSE server supports

 fs/fuse/dax.c             | 32 +++++++++++++++++++++++++++++---
 fs/fuse/file.c            |  4 ++--
 fs/fuse/fuse_i.h          | 22 ++++++++++++++++++----
 fs/fuse/inode.c           | 27 +++++++++++++++++++--------
 fs/fuse/ioctl.c           | 15 +++++++++++++--
 fs/fuse/virtio_fs.c       | 16 ++++++++++++++--
 include/uapi/linux/fuse.h |  9 ++++++++-
 7 files changed, 103 insertions(+), 22 deletions(-)

Comments

Miklos Szeredi Aug. 17, 2021, 8:06 a.m. UTC | #1
On Tue, 17 Aug 2021 at 04:22, Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
>
> This patchset adds support of per-file DAX for virtiofs, which is
> inspired by Ira Weiny's work on ext4[1] and xfs[2].

Can you please explain the background of this change in detail?

Why would an admin want to enable DAX for a particular virtiofs file
and not for others?

Thanks,
Miklos
Dr. David Alan Gilbert Aug. 17, 2021, 9:32 a.m. UTC | #2
* Miklos Szeredi (miklos@szeredi.hu) wrote:
> On Tue, 17 Aug 2021 at 04:22, Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
> >
> > This patchset adds support of per-file DAX for virtiofs, which is
> > inspired by Ira Weiny's work on ext4[1] and xfs[2].
> 
> Can you please explain the background of this change in detail?
> 
> Why would an admin want to enable DAX for a particular virtiofs file
> and not for others?

Where we're contending on virtiofs dax cache size it makes a lot of
sense; it's quite expensive for us to map something into the cache
(especially if we push something else out), so selectively DAXing files
that are expected to be hot could help reduce cache churn.

Dave

> Thanks,
> Miklos
> 
> _______________________________________________
> Virtio-fs mailing list
> Virtio-fs@redhat.com
> https://listman.redhat.com/mailman/listinfo/virtio-fs
>
Miklos Szeredi Aug. 17, 2021, 10:09 a.m. UTC | #3
On Tue, 17 Aug 2021 at 11:32, Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
>
> * Miklos Szeredi (miklos@szeredi.hu) wrote:
> > On Tue, 17 Aug 2021 at 04:22, Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
> > >
> > > This patchset adds support of per-file DAX for virtiofs, which is
> > > inspired by Ira Weiny's work on ext4[1] and xfs[2].
> >
> > Can you please explain the background of this change in detail?
> >
> > Why would an admin want to enable DAX for a particular virtiofs file
> > and not for others?
>
> Where we're contending on virtiofs dax cache size it makes a lot of
> sense; it's quite expensive for us to map something into the cache
> (especially if we push something else out), so selectively DAXing files
> that are expected to be hot could help reduce cache churn.

If this is a performance issue, it should be fixed in a way that
doesn't require hand tuning like you suggest, I think.

I'm not sure what the  ext4/xfs case for per-file DAX is.  Maybe that
can help understand the virtiofs case as well.

Thanks,
Miklos
Dr. David Alan Gilbert Aug. 17, 2021, 10:37 a.m. UTC | #4
* Miklos Szeredi (miklos@szeredi.hu) wrote:
> On Tue, 17 Aug 2021 at 11:32, Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> >
> > * Miklos Szeredi (miklos@szeredi.hu) wrote:
> > > On Tue, 17 Aug 2021 at 04:22, Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
> > > >
> > > > This patchset adds support of per-file DAX for virtiofs, which is
> > > > inspired by Ira Weiny's work on ext4[1] and xfs[2].
> > >
> > > Can you please explain the background of this change in detail?
> > >
> > > Why would an admin want to enable DAX for a particular virtiofs file
> > > and not for others?
> >
> > Where we're contending on virtiofs dax cache size it makes a lot of
> > sense; it's quite expensive for us to map something into the cache
> > (especially if we push something else out), so selectively DAXing files
> > that are expected to be hot could help reduce cache churn.
> 
> If this is a performance issue, it should be fixed in a way that
> doesn't require hand tuning like you suggest, I think.

I'd agree that would be nice; however:
  a) It looks like other filesystems already gave something admin
selectable
  b) Trying to write clever heuristics is only going to work in some
cases; being able to say 'DAX this directory' might work better in
practice.

> I'm not sure what the  ext4/xfs case for per-file DAX is.  Maybe that
> can help understand the virtiofs case as well.

Yep, I don't understand the case with real nvdimm hardware.

Dave

> Thanks,
> Miklos
>
Vivek Goyal Aug. 17, 2021, 12:39 p.m. UTC | #5
On Tue, Aug 17, 2021 at 10:06:53AM +0200, Miklos Szeredi wrote:
> On Tue, 17 Aug 2021 at 04:22, Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
> >
> > This patchset adds support of per-file DAX for virtiofs, which is
> > inspired by Ira Weiny's work on ext4[1] and xfs[2].
> 
> Can you please explain the background of this change in detail?
> 
> Why would an admin want to enable DAX for a particular virtiofs file
> and not for others?

Initially I thought that they needed it because they are downloading
files on the fly from server. So they don't want to enable dax on the file
till file is completely downloaded. But later I realized that they should
be able to block in FUSE_SETUPMAPPING call and make sure associated
file section has been downloaded before returning and solve the problem.
So that can't be the primary reason.

Other reason mentioned I think was that only certain files benefit
from DAX. But not much details are there after that. It will be nice
to hear a more concrete use case and more details about this usage.

Thanks
Vivek
Vivek Goyal Aug. 17, 2021, 12:40 p.m. UTC | #6
On Tue, Aug 17, 2021 at 10:32:14AM +0100, Dr. David Alan Gilbert wrote:
> * Miklos Szeredi (miklos@szeredi.hu) wrote:
> > On Tue, 17 Aug 2021 at 04:22, Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
> > >
> > > This patchset adds support of per-file DAX for virtiofs, which is
> > > inspired by Ira Weiny's work on ext4[1] and xfs[2].
> > 
> > Can you please explain the background of this change in detail?
> > 
> > Why would an admin want to enable DAX for a particular virtiofs file
> > and not for others?
> 
> Where we're contending on virtiofs dax cache size it makes a lot of
> sense; it's quite expensive for us to map something into the cache
> (especially if we push something else out), so selectively DAXing files
> that are expected to be hot could help reduce cache churn.

In that case probaly we should just make DAX window larger. I assume
that selecting which files to turn DAX on, will itself will not be
a trivial. Not sure what heuristics are being deployed to determine
that. Will like to know more about it.

Vivek
Jingbo Xu Aug. 17, 2021, 1:08 p.m. UTC | #7
On 8/17/21 6:09 PM, Miklos Szeredi wrote:
> On Tue, 17 Aug 2021 at 11:32, Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
>>
>> * Miklos Szeredi (miklos@szeredi.hu) wrote:
>>> On Tue, 17 Aug 2021 at 04:22, Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
>>>>
>>>> This patchset adds support of per-file DAX for virtiofs, which is
>>>> inspired by Ira Weiny's work on ext4[1] and xfs[2].
>>>
>>> Can you please explain the background of this change in detail?
>>>
>>> Why would an admin want to enable DAX for a particular virtiofs file
>>> and not for others?
>>
>> Where we're contending on virtiofs dax cache size it makes a lot of
>> sense; it's quite expensive for us to map something into the cache
>> (especially if we push something else out), so selectively DAXing files
>> that are expected to be hot could help reduce cache churn.
> 
> If this is a performance issue, it should be fixed in a way that
> doesn't require hand tuning like you suggest, I think.
> 
> I'm not sure what the  ext4/xfs case for per-file DAX is.  Maybe that
> can help understand the virtiofs case as well.
> 

Some hints why ext4/xfs support per-file DAX can be found [1] and [2].

"Boaz Harrosh wondered why someone might want to turn DAX off for a
persistent memory device. Hellwig said that the performance "could
suck"; Williams noted that the page cache could be useful for some
applications as well. Jan Kara pointed out that reads from persistent
memory are close to DRAM speed, but that writes are not; the page cache
could be helpful for frequent writes. Applications need to change to
fully take advantage of DAX, Williams said; part of the promise of
adding a flag is that users can do DAX on smaller granularities than a
full filesystem."

In summary, page cache is preferable in some cases, and thus more fine
grained way of DAX control is needed.


As for virtiofs, Dr. David Alan Gilbert has mentioned that various files
may compete for limited DAX window resource.

Besides, supporting DAX for small files can be expensive. Small files
can consume DAX window resource rapidly, and if small files are accessed
only once, the cost of mmap/munmap on host can not be ignored.


[1]
https://lore.kernel.org/lkml/20200428002142.404144-1-ira.weiny@intel.com/
[2] https://lwn.net/Articles/787973/
Jingbo Xu Aug. 17, 2021, 1:22 p.m. UTC | #8
On 8/17/21 8:39 PM, Vivek Goyal wrote:
> On Tue, Aug 17, 2021 at 10:06:53AM +0200, Miklos Szeredi wrote:
>> On Tue, 17 Aug 2021 at 04:22, Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
>>>
>>> This patchset adds support of per-file DAX for virtiofs, which is
>>> inspired by Ira Weiny's work on ext4[1] and xfs[2].
>>
>> Can you please explain the background of this change in detail?
>>
>> Why would an admin want to enable DAX for a particular virtiofs file
>> and not for others?
> 
> Initially I thought that they needed it because they are downloading
> files on the fly from server. So they don't want to enable dax on the file
> till file is completely downloaded. 

Right, it's our initial requirement.


> But later I realized that they should
> be able to block in FUSE_SETUPMAPPING call and make sure associated
> file section has been downloaded before returning and solve the problem.
> So that can't be the primary reason.

Saying we want to access 4KB of one file inside guest, if it goes
through FUSE request routine, then the fuse daemon only need to download
this 4KB from remote server. But if it goes through DAX, then the fuse
daemon need to download the whole DAX window (e.g., 2MB) from remote
server, so called amplification. Maybe we could decrease the DAX window
size, but it's a trade off.

> 
> Other reason mentioned I think was that only certain files benefit
> from DAX. But not much details are there after that. It will be nice
> to hear a more concrete use case and more details about this usage.
> 

Apart from our internal requirement, more fine grained control for DAX
shall be general and more flexible. Glad to hear more discussion from
community.
Miklos Szeredi Aug. 17, 2021, 2:08 p.m. UTC | #9
On Tue, 17 Aug 2021 at 15:22, JeffleXu <jefflexu@linux.alibaba.com> wrote:
>
>
>
> On 8/17/21 8:39 PM, Vivek Goyal wrote:
> > On Tue, Aug 17, 2021 at 10:06:53AM +0200, Miklos Szeredi wrote:
> >> On Tue, 17 Aug 2021 at 04:22, Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
> >>>
> >>> This patchset adds support of per-file DAX for virtiofs, which is
> >>> inspired by Ira Weiny's work on ext4[1] and xfs[2].
> >>
> >> Can you please explain the background of this change in detail?
> >>
> >> Why would an admin want to enable DAX for a particular virtiofs file
> >> and not for others?
> >
> > Initially I thought that they needed it because they are downloading
> > files on the fly from server. So they don't want to enable dax on the file
> > till file is completely downloaded.
>
> Right, it's our initial requirement.
>
>
> > But later I realized that they should
> > be able to block in FUSE_SETUPMAPPING call and make sure associated
> > file section has been downloaded before returning and solve the problem.
> > So that can't be the primary reason.
>
> Saying we want to access 4KB of one file inside guest, if it goes
> through FUSE request routine, then the fuse daemon only need to download
> this 4KB from remote server. But if it goes through DAX, then the fuse
> daemon need to download the whole DAX window (e.g., 2MB) from remote
> server, so called amplification. Maybe we could decrease the DAX window
> size, but it's a trade off.

That could be achieved with a plain fuse filesystem on the host (which
will get 4k READ requests for accesses to mapped area inside guest).
Since this can be done selectively for files which are not yet
downloaded, the extra layer wouldn't be a performance problem.

Is there a reason why that wouldn't work?

Thanks,
Miklos
Miklos Szeredi Aug. 17, 2021, 2:11 p.m. UTC | #10
On Tue, 17 Aug 2021 at 15:08, JeffleXu <jefflexu@linux.alibaba.com> wrote:
>
>
>
> On 8/17/21 6:09 PM, Miklos Szeredi wrote:
> > On Tue, 17 Aug 2021 at 11:32, Dr. David Alan Gilbert
> > <dgilbert@redhat.com> wrote:
> >>
> >> * Miklos Szeredi (miklos@szeredi.hu) wrote:
> >>> On Tue, 17 Aug 2021 at 04:22, Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
> >>>>
> >>>> This patchset adds support of per-file DAX for virtiofs, which is
> >>>> inspired by Ira Weiny's work on ext4[1] and xfs[2].
> >>>
> >>> Can you please explain the background of this change in detail?
> >>>
> >>> Why would an admin want to enable DAX for a particular virtiofs file
> >>> and not for others?
> >>
> >> Where we're contending on virtiofs dax cache size it makes a lot of
> >> sense; it's quite expensive for us to map something into the cache
> >> (especially if we push something else out), so selectively DAXing files
> >> that are expected to be hot could help reduce cache churn.
> >
> > If this is a performance issue, it should be fixed in a way that
> > doesn't require hand tuning like you suggest, I think.
> >
> > I'm not sure what the  ext4/xfs case for per-file DAX is.  Maybe that
> > can help understand the virtiofs case as well.
> >
>
> Some hints why ext4/xfs support per-file DAX can be found [1] and [2].
>
> "Boaz Harrosh wondered why someone might want to turn DAX off for a
> persistent memory device. Hellwig said that the performance "could
> suck"; Williams noted that the page cache could be useful for some
> applications as well. Jan Kara pointed out that reads from persistent
> memory are close to DRAM speed, but that writes are not; the page cache
> could be helpful for frequent writes. Applications need to change to
> fully take advantage of DAX, Williams said; part of the promise of
> adding a flag is that users can do DAX on smaller granularities than a
> full filesystem."
>
> In summary, page cache is preferable in some cases, and thus more fine
> grained way of DAX control is needed.

Hmm, okay, very frequent overwrites could be problematic for directly
mapped nvram.

>
> As for virtiofs, Dr. David Alan Gilbert has mentioned that various files
> may compete for limited DAX window resource.
>
> Besides, supporting DAX for small files can be expensive. Small files
> can consume DAX window resource rapidly, and if small files are accessed
> only once, the cost of mmap/munmap on host can not be ignored.

That's a good point.   Maybe we should disable DAX for file sizes much
smaller than the chunk size?

Thanks,
Miklos
Vivek Goyal Aug. 17, 2021, 2:54 p.m. UTC | #11
On Tue, Aug 17, 2021 at 09:08:35PM +0800, JeffleXu wrote:
> 
> 
> On 8/17/21 6:09 PM, Miklos Szeredi wrote:
> > On Tue, 17 Aug 2021 at 11:32, Dr. David Alan Gilbert
> > <dgilbert@redhat.com> wrote:
> >>
> >> * Miklos Szeredi (miklos@szeredi.hu) wrote:
> >>> On Tue, 17 Aug 2021 at 04:22, Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
> >>>>
> >>>> This patchset adds support of per-file DAX for virtiofs, which is
> >>>> inspired by Ira Weiny's work on ext4[1] and xfs[2].
> >>>
> >>> Can you please explain the background of this change in detail?
> >>>
> >>> Why would an admin want to enable DAX for a particular virtiofs file
> >>> and not for others?
> >>
> >> Where we're contending on virtiofs dax cache size it makes a lot of
> >> sense; it's quite expensive for us to map something into the cache
> >> (especially if we push something else out), so selectively DAXing files
> >> that are expected to be hot could help reduce cache churn.
> > 
> > If this is a performance issue, it should be fixed in a way that
> > doesn't require hand tuning like you suggest, I think.
> > 
> > I'm not sure what the  ext4/xfs case for per-file DAX is.  Maybe that
> > can help understand the virtiofs case as well.
> > 
> 
> Some hints why ext4/xfs support per-file DAX can be found [1] and [2].
> 
> "Boaz Harrosh wondered why someone might want to turn DAX off for a
> persistent memory device. Hellwig said that the performance "could
> suck"; Williams noted that the page cache could be useful for some
> applications as well. Jan Kara pointed out that reads from persistent
> memory are close to DRAM speed, but that writes are not; the page cache
> could be helpful for frequent writes. Applications need to change to
> fully take advantage of DAX, Williams said; part of the promise of
> adding a flag is that users can do DAX on smaller granularities than a
> full filesystem."
> 
> In summary, page cache is preferable in some cases, and thus more fine
> grained way of DAX control is needed.

In case of virtiofs, we are using page cache on host. So this probably
is not a factor for us. Writes will go in page cache of host.

> 
> 
> As for virtiofs, Dr. David Alan Gilbert has mentioned that various files
> may compete for limited DAX window resource.
> 
> Besides, supporting DAX for small files can be expensive. Small files
> can consume DAX window resource rapidly, and if small files are accessed
> only once, the cost of mmap/munmap on host can not be ignored.

W.r.r access pattern, same applies to large files also. So if a section
of large file is accessed only once, it will consume dax window as well
and will have to be reclaimed.

Dax in virtiofs provides speed gain only if map file once and access
it multiple times. If that pattern does not hold true, then dax does
not seem to provide speed gains and in fact might be slower than
non-dax.

So if there is a pattern where we know some files are accessed repeatedly
while others are not, then enabling/disabling dax selectively will make
sense. Question is how many workloads really know that and how will
you make that decision. Do you have any data to back that up.

W.r.t small file, is that a real concern. If that file is being accessed
mutliple times, then we will still see the speed gain. Only down side
is that there is little wastage of resources because our minimum dax
mapping granularity is 2MB. I am wondering can we handle that by
supporting other dax mapping granularities as well. say 256K and let
users choose it.

Thanks
Vivek
> 
> 
> [1]
> https://lore.kernel.org/lkml/20200428002142.404144-1-ira.weiny@intel.com/
> [2] https://lwn.net/Articles/787973/
> 
> -- 
> Thanks,
> Jeffle
>
Vivek Goyal Aug. 17, 2021, 2:57 p.m. UTC | #12
On Tue, Aug 17, 2021 at 09:22:53PM +0800, JeffleXu wrote:
> 
> 
> On 8/17/21 8:39 PM, Vivek Goyal wrote:
> > On Tue, Aug 17, 2021 at 10:06:53AM +0200, Miklos Szeredi wrote:
> >> On Tue, 17 Aug 2021 at 04:22, Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
> >>>
> >>> This patchset adds support of per-file DAX for virtiofs, which is
> >>> inspired by Ira Weiny's work on ext4[1] and xfs[2].
> >>
> >> Can you please explain the background of this change in detail?
> >>
> >> Why would an admin want to enable DAX for a particular virtiofs file
> >> and not for others?
> > 
> > Initially I thought that they needed it because they are downloading
> > files on the fly from server. So they don't want to enable dax on the file
> > till file is completely downloaded. 
> 
> Right, it's our initial requirement.
> 
> 
> > But later I realized that they should
> > be able to block in FUSE_SETUPMAPPING call and make sure associated
> > file section has been downloaded before returning and solve the problem.
> > So that can't be the primary reason.
> 
> Saying we want to access 4KB of one file inside guest, if it goes
> through FUSE request routine, then the fuse daemon only need to download
> this 4KB from remote server. But if it goes through DAX, then the fuse
> daemon need to download the whole DAX window (e.g., 2MB) from remote
> server, so called amplification. Maybe we could decrease the DAX window
> size, but it's a trade off.

Downloading 2MB chunk should not be a big issue (IMHO). And if this
turns out to be real concern, we could experiment with a smaller
mapping granularity.

> 
> > 
> > Other reason mentioned I think was that only certain files benefit
> > from DAX. But not much details are there after that. It will be nice
> > to hear a more concrete use case and more details about this usage.
> > 
> 
> Apart from our internal requirement, more fine grained control for DAX
> shall be general and more flexible. Glad to hear more discussion from
> community.

Sure it will be more general and flexible. But there needs to be 1-2
good concrete use cases to justify additional complexity. And I don't
think that so far a good use case has come forward.

Thanks
Vivek
Vivek Goyal Aug. 17, 2021, 3:19 p.m. UTC | #13
On Tue, Aug 17, 2021 at 04:11:14PM +0200, Miklos Szeredi wrote:

[..]
> > As for virtiofs, Dr. David Alan Gilbert has mentioned that various files
> > may compete for limited DAX window resource.
> >
> > Besides, supporting DAX for small files can be expensive. Small files
> > can consume DAX window resource rapidly, and if small files are accessed
> > only once, the cost of mmap/munmap on host can not be ignored.
> 
> That's a good point.   Maybe we should disable DAX for file sizes much
> smaller than the chunk size?

This indeed seems like a valid concern. 2MB chunk size will consume
512 struct page entries. If an entry is 64 bytes in size, then that's
32K RAM used to access 4K bytes of file. Does not sound like good usage
of resources.

If we end up selectively disabling dax based on file size, two things
come to me mind.

- Will be good if it is users can opt-in for this behavior. There
  might be a class of users who always want to enable dax on all
  files.

- Secondly, we will have to figure out how to do it safely in the
  event of shared filesystem where file size can change suddenly.
  Will need to make sure change from dax to no-dax and vice-versa
  is safe w.r.t page cache and other paths.

Thanks
Vivek
Jingbo Xu Aug. 18, 2021, 3:39 a.m. UTC | #14
On 8/17/21 10:08 PM, Miklos Szeredi wrote:
> On Tue, 17 Aug 2021 at 15:22, JeffleXu <jefflexu@linux.alibaba.com> wrote:
>>
>>
>>
>> On 8/17/21 8:39 PM, Vivek Goyal wrote:
>>> On Tue, Aug 17, 2021 at 10:06:53AM +0200, Miklos Szeredi wrote:
>>>> On Tue, 17 Aug 2021 at 04:22, Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
>>>>>
>>>>> This patchset adds support of per-file DAX for virtiofs, which is
>>>>> inspired by Ira Weiny's work on ext4[1] and xfs[2].
>>>>
>>>> Can you please explain the background of this change in detail?
>>>>
>>>> Why would an admin want to enable DAX for a particular virtiofs file
>>>> and not for others?
>>>
>>> Initially I thought that they needed it because they are downloading
>>> files on the fly from server. So they don't want to enable dax on the file
>>> till file is completely downloaded.
>>
>> Right, it's our initial requirement.
>>
>>
>>> But later I realized that they should
>>> be able to block in FUSE_SETUPMAPPING call and make sure associated
>>> file section has been downloaded before returning and solve the problem.
>>> So that can't be the primary reason.
>>
>> Saying we want to access 4KB of one file inside guest, if it goes
>> through FUSE request routine, then the fuse daemon only need to download
>> this 4KB from remote server. But if it goes through DAX, then the fuse
>> daemon need to download the whole DAX window (e.g., 2MB) from remote
>> server, so called amplification. Maybe we could decrease the DAX window
>> size, but it's a trade off.
> 
> That could be achieved with a plain fuse filesystem on the host (which
> will get 4k READ requests for accesses to mapped area inside guest).
> Since this can be done selectively for files which are not yet
> downloaded, the extra layer wouldn't be a performance problem.

I'm not sure if I fully understand your idea. Then in this case, host
daemon only prepares 4KB while guest thinks that the whole DAX window
(e.g., 2MB) has been fully mapped. Then when guest really accesses the
remained part (2MB - 4KB), page fault is triggered, and now host daemon
is responsible for downloading the remained part?

> 
> Is there a reason why that wouldn't work?
> 
> Thanks,
> Miklos
>
Miklos Szeredi Aug. 18, 2021, 5:08 a.m. UTC | #15
On Wed, 18 Aug 2021 at 05:40, JeffleXu <jefflexu@linux.alibaba.com> wrote:

> I'm not sure if I fully understand your idea. Then in this case, host
> daemon only prepares 4KB while guest thinks that the whole DAX window
> (e.g., 2MB) has been fully mapped. Then when guest really accesses the
> remained part (2MB - 4KB), page fault is triggered, and now host daemon
> is responsible for downloading the remained part?

Yes.  Mapping an area just means setting up the page tables, it does
not result in actual data transfer.

Thanks,
Miklos
Jingbo Xu Aug. 18, 2021, 5:10 a.m. UTC | #16
On 8/17/21 10:54 PM, Vivek Goyal wrote:
> On Tue, Aug 17, 2021 at 09:08:35PM +0800, JeffleXu wrote:
>>
>>
>> On 8/17/21 6:09 PM, Miklos Szeredi wrote:
>>> On Tue, 17 Aug 2021 at 11:32, Dr. David Alan Gilbert
>>> <dgilbert@redhat.com> wrote:
>>>>
>>>> * Miklos Szeredi (miklos@szeredi.hu) wrote:
>>>>> On Tue, 17 Aug 2021 at 04:22, Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
>>>>>>
>>>>>> This patchset adds support of per-file DAX for virtiofs, which is
>>>>>> inspired by Ira Weiny's work on ext4[1] and xfs[2].
>>>>>
>>>>> Can you please explain the background of this change in detail?
>>>>>
>>>>> Why would an admin want to enable DAX for a particular virtiofs file
>>>>> and not for others?
>>>>
>>>> Where we're contending on virtiofs dax cache size it makes a lot of
>>>> sense; it's quite expensive for us to map something into the cache
>>>> (especially if we push something else out), so selectively DAXing files
>>>> that are expected to be hot could help reduce cache churn.
>>>
>>> If this is a performance issue, it should be fixed in a way that
>>> doesn't require hand tuning like you suggest, I think.
>>>
>>> I'm not sure what the  ext4/xfs case for per-file DAX is.  Maybe that
>>> can help understand the virtiofs case as well.
>>>
>>
>> Some hints why ext4/xfs support per-file DAX can be found [1] and [2].
>>
>> "Boaz Harrosh wondered why someone might want to turn DAX off for a
>> persistent memory device. Hellwig said that the performance "could
>> suck"; Williams noted that the page cache could be useful for some
>> applications as well. Jan Kara pointed out that reads from persistent
>> memory are close to DRAM speed, but that writes are not; the page cache
>> could be helpful for frequent writes. Applications need to change to
>> fully take advantage of DAX, Williams said; part of the promise of
>> adding a flag is that users can do DAX on smaller granularities than a
>> full filesystem."
>>
>> In summary, page cache is preferable in some cases, and thus more fine
>> grained way of DAX control is needed.
> 
> In case of virtiofs, we are using page cache on host. So this probably
> is not a factor for us. Writes will go in page cache of host.
> 
>>
>>
>> As for virtiofs, Dr. David Alan Gilbert has mentioned that various files
>> may compete for limited DAX window resource.
>>
>> Besides, supporting DAX for small files can be expensive. Small files
>> can consume DAX window resource rapidly, and if small files are accessed
>> only once, the cost of mmap/munmap on host can not be ignored.
> 
> W.r.r access pattern, same applies to large files also. So if a section
> of large file is accessed only once, it will consume dax window as well
> and will have to be reclaimed.
> 
> Dax in virtiofs provides speed gain only if map file once and access
> it multiple times. If that pattern does not hold true, then dax does
> not seem to provide speed gains and in fact might be slower than
> non-dax.
> 
> So if there is a pattern where we know some files are accessed repeatedly
> while others are not, then enabling/disabling dax selectively will make
> sense. Question is how many workloads really know that and how will
> you make that decision. Do you have any data to back that up.

There's no precise performance data yet. Empirically, small files used
to have worse performance with dax, while frequently accessed files
(such as .so libraries) behave better with dax.

> 
> W.r.t small file, is that a real concern. If that file is being accessed
> mutliple times, then we will still see the speed gain. Only down side
> is that there is little wastage of resources because our minimum dax
> mapping granularity is 2MB. I am wondering can we handle that by
> supporting other dax mapping granularities as well. say 256K and let
> users choose it.
Jingbo Xu Aug. 18, 2021, 5:20 a.m. UTC | #17
On 8/17/21 10:57 PM, Vivek Goyal wrote:
> On Tue, Aug 17, 2021 at 09:22:53PM +0800, JeffleXu wrote:
>>
>>
>> On 8/17/21 8:39 PM, Vivek Goyal wrote:
>>> On Tue, Aug 17, 2021 at 10:06:53AM +0200, Miklos Szeredi wrote:
>>>> On Tue, 17 Aug 2021 at 04:22, Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
>>>>>
>>>>> This patchset adds support of per-file DAX for virtiofs, which is
>>>>> inspired by Ira Weiny's work on ext4[1] and xfs[2].
>>>>
>>>> Can you please explain the background of this change in detail?
>>>>
>>>> Why would an admin want to enable DAX for a particular virtiofs file
>>>> and not for others?
>>>
>>> Initially I thought that they needed it because they are downloading
>>> files on the fly from server. So they don't want to enable dax on the file
>>> till file is completely downloaded. 
>>
>> Right, it's our initial requirement.
>>
>>
>>> But later I realized that they should
>>> be able to block in FUSE_SETUPMAPPING call and make sure associated
>>> file section has been downloaded before returning and solve the problem.
>>> So that can't be the primary reason.
>>
>> Saying we want to access 4KB of one file inside guest, if it goes
>> through FUSE request routine, then the fuse daemon only need to download
>> this 4KB from remote server. But if it goes through DAX, then the fuse
>> daemon need to download the whole DAX window (e.g., 2MB) from remote
>> server, so called amplification. Maybe we could decrease the DAX window
>> size, but it's a trade off.
> 
> Downloading 2MB chunk should not be a big issue (IMHO). 

Then the latency increases. Latency really matters in our use case.


> And if this
> turns out to be real concern, we could experiment with a smaller
> mapping granularity.
>
Vivek Goyal Aug. 18, 2021, 4:58 p.m. UTC | #18
On Wed, Aug 18, 2021 at 07:08:24AM +0200, Miklos Szeredi wrote:
> On Wed, 18 Aug 2021 at 05:40, JeffleXu <jefflexu@linux.alibaba.com> wrote:
> 
> > I'm not sure if I fully understand your idea. Then in this case, host
> > daemon only prepares 4KB while guest thinks that the whole DAX window
> > (e.g., 2MB) has been fully mapped. Then when guest really accesses the
> > remained part (2MB - 4KB), page fault is triggered, and now host daemon
> > is responsible for downloading the remained part?
> 
> Yes.  Mapping an area just means setting up the page tables, it does
> not result in actual data transfer.

But daemon will not get the page fault (its the host kernel which
will handle it). And host kernel does not know that file chunk 
needs to be downloaded.

- Either we somehow figure out user fault handling and somehow
  qemu/virtiofsd get to handle the page fault then they can
  download file.

- Or we download the 2MB chunk at the FUSE_SETUPMAPPING time so
  that later kernel fault can handle it.

Am I missing something.

Vivek
Jingbo Xu Aug. 19, 2021, 6:14 a.m. UTC | #19
On 8/17/21 10:54 PM, Vivek Goyal wrote:
[...]
>>
>> As for virtiofs, Dr. David Alan Gilbert has mentioned that various files
>> may compete for limited DAX window resource.
>>
>> Besides, supporting DAX for small files can be expensive. Small files
>> can consume DAX window resource rapidly, and if small files are accessed
>> only once, the cost of mmap/munmap on host can not be ignored.
> 
> W.r.r access pattern, same applies to large files also. So if a section
> of large file is accessed only once, it will consume dax window as well
> and will have to be reclaimed.
> 
> Dax in virtiofs provides speed gain only if map file once and access
> it multiple times. If that pattern does not hold true, then dax does
> not seem to provide speed gains and in fact might be slower than
> non-dax.
> 
> So if there is a pattern where we know some files are accessed repeatedly
> while others are not, then enabling/disabling dax selectively will make
> sense. Question is how many workloads really know that and how will
> you make that decision. Do you have any data to back that up.

Empirically, some files are naturally accessed only once, such as
configuration files under /etc/ directory, .py, .js files, etc. It's the
real case that we have met in real world. While some others are most
likely accessed multiple times, such as .so libraries. With per-file DAX
feature, administrator can decide on their own which files shall be dax
enabled and thus gain most benefit from dax, while others not.

As for how we can distinguish the file access mode, besides the
intuitive insights described previously, we can develop more advanced
method distinguishing it, e.g., scanning the DAX window map and finding
the hot files. With the mechanism offered by kernel, more advanced
strategy can be developed then.

> 
> W.r.t small file, is that a real concern. If that file is being accessed
> mutliple times, then we will still see the speed gain. Only down side
> is that there is little wastage of resources because our minimum dax
> mapping granularity is 2MB. I am wondering can we handle that by
> supporting other dax mapping granularities as well. say 256K and let
> users choose it.
>
Jingbo Xu Sept. 3, 2021, 5:30 a.m. UTC | #20
On 8/17/21 10:08 PM, Miklos Szeredi wrote:
> On Tue, 17 Aug 2021 at 15:22, JeffleXu <jefflexu@linux.alibaba.com> wrote:
>>
>>
>>
>> On 8/17/21 8:39 PM, Vivek Goyal wrote:
>>> On Tue, Aug 17, 2021 at 10:06:53AM +0200, Miklos Szeredi wrote:
>>>> On Tue, 17 Aug 2021 at 04:22, Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
>>>>>
>>>>> This patchset adds support of per-file DAX for virtiofs, which is
>>>>> inspired by Ira Weiny's work on ext4[1] and xfs[2].
>>>>
>>>> Can you please explain the background of this change in detail?
>>>>
>>>> Why would an admin want to enable DAX for a particular virtiofs file
>>>> and not for others?
>>>
>>> Initially I thought that they needed it because they are downloading
>>> files on the fly from server. So they don't want to enable dax on the file
>>> till file is completely downloaded.
>>
>> Right, it's our initial requirement.
>>
>>
>>> But later I realized that they should
>>> be able to block in FUSE_SETUPMAPPING call and make sure associated
>>> file section has been downloaded before returning and solve the problem.
>>> So that can't be the primary reason.
>>
>> Saying we want to access 4KB of one file inside guest, if it goes
>> through FUSE request routine, then the fuse daemon only need to download
>> this 4KB from remote server. But if it goes through DAX, then the fuse
>> daemon need to download the whole DAX window (e.g., 2MB) from remote
>> server, so called amplification. Maybe we could decrease the DAX window
>> size, but it's a trade off.
> 
> That could be achieved with a plain fuse filesystem on the host (which
> will get 4k READ requests for accesses to mapped area inside guest).
> Since this can be done selectively for files which are not yet
> downloaded, the extra layer wouldn't be a performance problem.
> 
> Is there a reason why that wouldn't work?

I didn't realize this mechanism (working around from user space) before
sending this patch set.

After learning the virtualization and KVM stuffs, I find that, as Vivek
Goyal replied in [1], virtiofsd/qemu need to somehow hook the user page
fault and then download the remained part.

IMHO, this mechanism (as you proposed by implementing a plain fuse
filesystem on the host) seems a little bit sophisticated so far.


[1] https://lore.kernel.org/linux-fsdevel/YR08KnP8cO8LjKY7@redhat.com/
Miklos Szeredi Sept. 7, 2021, 2:51 p.m. UTC | #21
On Fri, 3 Sept 2021 at 07:31, JeffleXu <jefflexu@linux.alibaba.com> wrote:
>
>
>
> On 8/17/21 10:08 PM, Miklos Szeredi wrote:
> > On Tue, 17 Aug 2021 at 15:22, JeffleXu <jefflexu@linux.alibaba.com> wrote:
> >>
> >>
> >>
> >> On 8/17/21 8:39 PM, Vivek Goyal wrote:
> >>> On Tue, Aug 17, 2021 at 10:06:53AM +0200, Miklos Szeredi wrote:
> >>>> On Tue, 17 Aug 2021 at 04:22, Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
> >>>>>
> >>>>> This patchset adds support of per-file DAX for virtiofs, which is
> >>>>> inspired by Ira Weiny's work on ext4[1] and xfs[2].
> >>>>
> >>>> Can you please explain the background of this change in detail?
> >>>>
> >>>> Why would an admin want to enable DAX for a particular virtiofs file
> >>>> and not for others?
> >>>
> >>> Initially I thought that they needed it because they are downloading
> >>> files on the fly from server. So they don't want to enable dax on the file
> >>> till file is completely downloaded.
> >>
> >> Right, it's our initial requirement.
> >>
> >>
> >>> But later I realized that they should
> >>> be able to block in FUSE_SETUPMAPPING call and make sure associated
> >>> file section has been downloaded before returning and solve the problem.
> >>> So that can't be the primary reason.
> >>
> >> Saying we want to access 4KB of one file inside guest, if it goes
> >> through FUSE request routine, then the fuse daemon only need to download
> >> this 4KB from remote server. But if it goes through DAX, then the fuse
> >> daemon need to download the whole DAX window (e.g., 2MB) from remote
> >> server, so called amplification. Maybe we could decrease the DAX window
> >> size, but it's a trade off.
> >
> > That could be achieved with a plain fuse filesystem on the host (which
> > will get 4k READ requests for accesses to mapped area inside guest).
> > Since this can be done selectively for files which are not yet
> > downloaded, the extra layer wouldn't be a performance problem.
> >
> > Is there a reason why that wouldn't work?
>
> I didn't realize this mechanism (working around from user space) before
> sending this patch set.
>
> After learning the virtualization and KVM stuffs, I find that, as Vivek
> Goyal replied in [1], virtiofsd/qemu need to somehow hook the user page
> fault and then download the remained part.
>
> IMHO, this mechanism (as you proposed by implementing a plain fuse
> filesystem on the host) seems a little bit sophisticated so far.


Agree.  Let's start with the simplest variant, which is the server
selectively enabling dax.

Thanks,
Miklos
Jingbo Xu Sept. 16, 2021, 8:21 a.m. UTC | #22
Hi, I add some performance statistics below.


On 8/17/21 8:40 PM, Vivek Goyal wrote:
> On Tue, Aug 17, 2021 at 10:32:14AM +0100, Dr. David Alan Gilbert wrote:
>> * Miklos Szeredi (miklos@szeredi.hu) wrote:
>>> On Tue, 17 Aug 2021 at 04:22, Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
>>>>
>>>> This patchset adds support of per-file DAX for virtiofs, which is
>>>> inspired by Ira Weiny's work on ext4[1] and xfs[2].
>>>
>>> Can you please explain the background of this change in detail?
>>>
>>> Why would an admin want to enable DAX for a particular virtiofs file
>>> and not for others?
>>
>> Where we're contending on virtiofs dax cache size it makes a lot of
>> sense; it's quite expensive for us to map something into the cache
>> (especially if we push something else out), so selectively DAXing files
>> that are expected to be hot could help reduce cache churn.

Yes, the performance of dax can be limited when the DAX window is
limited, where dax window may be contended by multiple files.

I tested kernel compiling in virtiofs, emulating the scenario where a
lot of files contending dax window and triggering dax window reclaiming.

Environment setup:
- guest vCPU: 16
- time make vmlinux -j128

type    | cache  | cache-size | time
------- | ------ | ---------- | ----
non-dax | always |   --       | real 2m48.119s
dax     | always | 64M        | real 4m49.563s
dax     | always |   1G       | real 3m14.200s
dax     | always |   4G       | real 2m41.141s


It can be seen that there's performance drop, comparing to the normal
buffered IO, when dax window resource is restricted and dax window
relcaiming is triggered. The smaller the cache size is, the worse the
performance is. The performance drop can be alleviated and eliminated as
cache size increases.

Though we may not compile kernel in virtiofs, indeed we may access a lot
of small files in virtiofs and suffer this performance drop.


> In that case probaly we should just make DAX window larger. I assume

Yes, as the DAX window gets larger, it is less likely that we can run
short of dax window resource.

However it doesn't come without cost. 'struct page' descriptor for dax
window will consume guest memory at a ratio of ~1.5% (64/4096 = ~1.5%,
page descriptor is of 64 bytes size, assuming 4K sized page). That is,
every 1GB cache size will cost 16MB guest memory. As the cache size
increases, the memory footprint for page descriptors also increases,
which may offset the benefit of dax by eliminating guest page cache.

In summary, per-file dax feature tries to achieve a balance between
performance and memory overhead, by offering a finer gained control for
dax to users.


> that selecting which files to turn DAX on, will itself will not be
> a trivial. Not sure what heuristics are being deployed to determine
> that. Will like to know more about it.

Currently we enable dax for hot and large blob files, while disabling
dax for other miscellaneous small files.
Jingbo Xu Sept. 18, 2021, 3:06 a.m. UTC | #23
Hi Vivek, Miklos,

On 9/16/21 4:21 PM, JeffleXu wrote:
> Hi, I add some performance statistics below.
> 
> 
> On 8/17/21 8:40 PM, Vivek Goyal wrote:
>> On Tue, Aug 17, 2021 at 10:32:14AM +0100, Dr. David Alan Gilbert wrote:
>>> * Miklos Szeredi (miklos@szeredi.hu) wrote:
>>>> On Tue, 17 Aug 2021 at 04:22, Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
>>>>>
>>>>> This patchset adds support of per-file DAX for virtiofs, which is
>>>>> inspired by Ira Weiny's work on ext4[1] and xfs[2].
>>>>
>>>> Can you please explain the background of this change in detail?
>>>>
>>>> Why would an admin want to enable DAX for a particular virtiofs file
>>>> and not for others?
>>>
>>> Where we're contending on virtiofs dax cache size it makes a lot of
>>> sense; it's quite expensive for us to map something into the cache
>>> (especially if we push something else out), so selectively DAXing files
>>> that are expected to be hot could help reduce cache churn.
> 
> Yes, the performance of dax can be limited when the DAX window is
> limited, where dax window may be contended by multiple files.
> 
> I tested kernel compiling in virtiofs, emulating the scenario where a
> lot of files contending dax window and triggering dax window reclaiming.
> 
> Environment setup:
> - guest vCPU: 16
> - time make vmlinux -j128
> 
> type    | cache  | cache-size | time
> ------- | ------ | ---------- | ----
> non-dax | always |   --       | real 2m48.119s
> dax     | always | 64M        | real 4m49.563s
> dax     | always |   1G       | real 3m14.200s
> dax     | always |   4G       | real 2m41.141s
> 
> 
> It can be seen that there's performance drop, comparing to the normal
> buffered IO, when dax window resource is restricted and dax window
> relcaiming is triggered. The smaller the cache size is, the worse the
> performance is. The performance drop can be alleviated and eliminated as
> cache size increases.
> 
> Though we may not compile kernel in virtiofs, indeed we may access a lot
> of small files in virtiofs and suffer this performance drop.
> 
> 
>> In that case probaly we should just make DAX window larger. I assume
> 
> Yes, as the DAX window gets larger, it is less likely that we can run
> short of dax window resource.
> 
> However it doesn't come without cost. 'struct page' descriptor for dax
> window will consume guest memory at a ratio of ~1.5% (64/4096 = ~1.5%,
> page descriptor is of 64 bytes size, assuming 4K sized page). That is,
> every 1GB cache size will cost 16MB guest memory. As the cache size
> increases, the memory footprint for page descriptors also increases,
> which may offset the benefit of dax by eliminating guest page cache.
> 
> In summary, per-file dax feature tries to achieve a balance between
> performance and memory overhead, by offering a finer gained control for
> dax to users.
> 

I'm not sure if this is adequate for introducing per-file dax feature to
community? Need some feedback from the community.

And if that's the case, I also want to know if setting/clearing S_DAX
inside guest is needed, since in our internal using scenario, setting
S_DAX from host daemon is adequate. If setting/clearing S_DAX inside
guest can be omitted then, the negotiation during FUSE_INIT phase is not
needed either. After all we could completely rely on the FUSE_ATTR_DAX
flag feeded by host daemon to see if dax shall be enabled or not for
corresponding file. The whole patch set will also be somehow simper then.
Vivek Goyal Sept. 19, 2021, 7:45 p.m. UTC | #24
On Thu, Sep 16, 2021 at 04:21:59PM +0800, JeffleXu wrote:
> Hi, I add some performance statistics below.
> 
> 
> On 8/17/21 8:40 PM, Vivek Goyal wrote:
> > On Tue, Aug 17, 2021 at 10:32:14AM +0100, Dr. David Alan Gilbert wrote:
> >> * Miklos Szeredi (miklos@szeredi.hu) wrote:
> >>> On Tue, 17 Aug 2021 at 04:22, Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
> >>>>
> >>>> This patchset adds support of per-file DAX for virtiofs, which is
> >>>> inspired by Ira Weiny's work on ext4[1] and xfs[2].
> >>>
> >>> Can you please explain the background of this change in detail?
> >>>
> >>> Why would an admin want to enable DAX for a particular virtiofs file
> >>> and not for others?
> >>
> >> Where we're contending on virtiofs dax cache size it makes a lot of
> >> sense; it's quite expensive for us to map something into the cache
> >> (especially if we push something else out), so selectively DAXing files
> >> that are expected to be hot could help reduce cache churn.
> 
> Yes, the performance of dax can be limited when the DAX window is
> limited, where dax window may be contended by multiple files.
> 
> I tested kernel compiling in virtiofs, emulating the scenario where a
> lot of files contending dax window and triggering dax window reclaiming.
> 
> Environment setup:
> - guest vCPU: 16
> - time make vmlinux -j128
> 
> type    | cache  | cache-size | time
> ------- | ------ | ---------- | ----
> non-dax | always |   --       | real 2m48.119s
> dax     | always | 64M        | real 4m49.563s
> dax     | always |   1G       | real 3m14.200s
> dax     | always |   4G       | real 2m41.141s
> 
> 
> It can be seen that there's performance drop, comparing to the normal
> buffered IO, when dax window resource is restricted and dax window
> relcaiming is triggered. The smaller the cache size is, the worse the
> performance is. The performance drop can be alleviated and eliminated as
> cache size increases.
> 
> Though we may not compile kernel in virtiofs, indeed we may access a lot
> of small files in virtiofs and suffer this performance drop.

Hi Jeffle,

If you access lot of big files or a file bigger than dax window, still
you will face performance drop due to reclaim. IOW, if data being
accessed is bigger than dax window, then reclaim will trigger and
performance drop will be observed. So I think its not fair to assciate
performance drop with big for small files as such.

What makes more sense is that memomry usage argument you have used
later in the email. That is, we have a fixed chunk size of 2MB. And
that means we use 512 * 64 = 32K of memory per chunk. So if a file
is smaller than 32K in size, it might be better to just access it
without DAX and incur the cost of page cache in guest instead. Even this
argument also works only if dax window is being utilized fully.

Anyway, I think Miklos already asked you to send patches so that
virtiofs daemon specifies which file to use dax on. So are you
planning to post patches again for that. (And drop patches to
read dax attr from per inode from filesystem in guest).

Thanks
Vivek

> 
> 
> > In that case probaly we should just make DAX window larger. I assume
> 
> Yes, as the DAX window gets larger, it is less likely that we can run
> short of dax window resource.
> 
> However it doesn't come without cost. 'struct page' descriptor for dax
> window will consume guest memory at a ratio of ~1.5% (64/4096 = ~1.5%,
> page descriptor is of 64 bytes size, assuming 4K sized page). That is,
> every 1GB cache size will cost 16MB guest memory. As the cache size
> increases, the memory footprint for page descriptors also increases,
> which may offset the benefit of dax by eliminating guest page cache.
> 
> In summary, per-file dax feature tries to achieve a balance between
> performance and memory overhead, by offering a finer gained control for
> dax to users.
> 
> 
> > that selecting which files to turn DAX on, will itself will not be
> > a trivial. Not sure what heuristics are being deployed to determine
> > that. Will like to know more about it.
> 
> Currently we enable dax for hot and large blob files, while disabling
> dax for other miscellaneous small files.
> 
> 
> 
> -- 
> Thanks,
> Jeffle
>
Jingbo Xu Sept. 22, 2021, 8:16 a.m. UTC | #25
Thanks for the replying and suggesting. ;)


On 9/20/21 3:45 AM, Vivek Goyal wrote:
> On Thu, Sep 16, 2021 at 04:21:59PM +0800, JeffleXu wrote:
>> Hi, I add some performance statistics below.
>>
>>
>> On 8/17/21 8:40 PM, Vivek Goyal wrote:
>>> On Tue, Aug 17, 2021 at 10:32:14AM +0100, Dr. David Alan Gilbert wrote:
>>>> * Miklos Szeredi (miklos@szeredi.hu) wrote:
>>>>> On Tue, 17 Aug 2021 at 04:22, Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
>>>>>>
>>>>>> This patchset adds support of per-file DAX for virtiofs, which is
>>>>>> inspired by Ira Weiny's work on ext4[1] and xfs[2].
>>>>>
>>>>> Can you please explain the background of this change in detail?
>>>>>
>>>>> Why would an admin want to enable DAX for a particular virtiofs file
>>>>> and not for others?
>>>>
>>>> Where we're contending on virtiofs dax cache size it makes a lot of
>>>> sense; it's quite expensive for us to map something into the cache
>>>> (especially if we push something else out), so selectively DAXing files
>>>> that are expected to be hot could help reduce cache churn.
>>
>> Yes, the performance of dax can be limited when the DAX window is
>> limited, where dax window may be contended by multiple files.
>>
>> I tested kernel compiling in virtiofs, emulating the scenario where a
>> lot of files contending dax window and triggering dax window reclaiming.
>>
>> Environment setup:
>> - guest vCPU: 16
>> - time make vmlinux -j128
>>
>> type    | cache  | cache-size | time
>> ------- | ------ | ---------- | ----
>> non-dax | always |   --       | real 2m48.119s
>> dax     | always | 64M        | real 4m49.563s
>> dax     | always |   1G       | real 3m14.200s
>> dax     | always |   4G       | real 2m41.141s
>>
>>
>> It can be seen that there's performance drop, comparing to the normal
>> buffered IO, when dax window resource is restricted and dax window
>> relcaiming is triggered. The smaller the cache size is, the worse the
>> performance is. The performance drop can be alleviated and eliminated as
>> cache size increases.
>>
>> Though we may not compile kernel in virtiofs, indeed we may access a lot
>> of small files in virtiofs and suffer this performance drop.
> 
> Hi Jeffle,
> 
> If you access lot of big files or a file bigger than dax window, still
> you will face performance drop due to reclaim. IOW, if data being
> accessed is bigger than dax window, then reclaim will trigger and
> performance drop will be observed. So I think its not fair to assciate
> performance drop with big for small files as such.

Yes, it is. Actually what I mean is that small files (with size smaller
than dax window chunk size) is more likely to consume more dax windows
compared to large files, under the same total file size.


> 
> What makes more sense is that memomry usage argument you have used
> later in the email. That is, we have a fixed chunk size of 2MB. And
> that means we use 512 * 64 = 32K of memory per chunk. So if a file
> is smaller than 32K in size, it might be better to just access it
> without DAX and incur the cost of page cache in guest instead. Even this
> argument also works only if dax window is being utilized fully.

Yes, agreed. In this case, the meaning of per-file dax is that, admin
could control the size of overall dax window under a limited number,
while still sustaining a reasonable performance. But at least, users are
capable of tuning it now.

> 
> Anyway, I think Miklos already asked you to send patches so that
> virtiofs daemon specifies which file to use dax on. So are you
> planning to post patches again for that. (And drop patches to
> read dax attr from per inode from filesystem in guest).

OK. I will send a new version, disabling dax based on the file size on
the host daemon side. Besides, I'm afraid the negotiation phase is also
not needed anymore, since currently the hint whether dax shall be
enabled or not is completely feeded from host daemon, and the guest side
needn't set/clear per inode dax attr now.