mbox series

[00/20] virtiofs: Add DAX support

Message ID 20200304165845.3081-1-vgoyal@redhat.com (mailing list archive)
Headers show
Series virtiofs: Add DAX support | expand

Message

Vivek Goyal March 4, 2020, 4:58 p.m. UTC
Hi,

This patch series adds DAX support to virtiofs filesystem. This allows
bypassing guest page cache and allows mapping host page cache directly
in guest address space.

When a page of file is needed, guest sends a request to map that page
(in host page cache) in qemu address space. Inside guest this is
a physical memory range controlled by virtiofs device. And guest
directly maps this physical address range using DAX and hence gets
access to file data on host.

This can speed up things considerably in many situations. Also this
can result in substantial memory savings as file data does not have
to be copied in guest and it is directly accessed from host page
cache.

Most of the changes are limited to fuse/virtiofs. There are couple
of changes needed in generic dax infrastructure and couple of changes
in virtio to be able to access shared memory region.

These patches apply on top of 5.6-rc4 and are also available here.

https://github.com/rhvgoyal/linux/commits/vivek-04-march-2020

Any review or feedback is welcome.

Performance
===========
I have basically run bunch of fio jobs to get a sense of speed of
various operations. I wrote a simple wrapper script to run fio jobs
3 times and take their average and report it. These scripts and fio
jobs are available here.

https://github.com/rhvgoyal/virtiofs-tests

I set up a directory on ramfs on host and exported that directory inside
guest using virtio-fs and ran tests inside guests. Ran tests with
cache=none both with dax enabled and disabled. cache=none option
enforces no caching happens in guest both for data and metadata.

Test Setup
-----------
- A fedora 29 host with 376Gi RAM, 2 sockets (20 cores per socket, 2
  threads per core)

- Using ramfs on host as backing store. 4 fio files of 8G each.

- Created a VM with 64 VCPUS and 64GB memory. An 64GB cache window (for dax
  mmap).

Test Results
------------
- Results in two configurations have been reported. 
  virtio-fs (cache=none) and virtio-fs (cache=none + dax).

  There are other caching modes as well but to me cache=none seemed most
  interesting for now because it does not cache anything in guest
  and provides strong coherence. Other modes which provide less strong
  coherence and hence are faster are yet to be benchmarked.

- Three fio ioengines psync, libaio and mmap have been used.

- I/O Workload of randread, radwrite, seqread and seqwrite have been run.

- Each file size is 8G. Block size 4K. iodepth=16 

- "multi" means same operation was done with 4 jobs and each job is
  operating on a file of size 8G. 

- Some results are "0 (KiB/s)". That means that particular operation is
  not supported in that configuration.

NAME                    I/O Operation           BW(Read/Write)
virtiofs-cache-none     seqread-psync           35(MiB/s)
virtiofs-cache-none-dax seqread-psync           643(MiB/s)

virtiofs-cache-none     seqread-psync-multi     219(MiB/s)
virtiofs-cache-none-dax seqread-psync-multi     2132(MiB/s)

virtiofs-cache-none     seqread-mmap            0(KiB/s)
virtiofs-cache-none-dax seqread-mmap            741(MiB/s)

virtiofs-cache-none     seqread-mmap-multi      0(KiB/s)
virtiofs-cache-none-dax seqread-mmap-multi      2530(MiB/s)

virtiofs-cache-none     seqread-libaio          293(MiB/s)
virtiofs-cache-none-dax seqread-libaio          425(MiB/s)

virtiofs-cache-none     seqread-libaio-multi    207(MiB/s)
virtiofs-cache-none-dax seqread-libaio-multi    1543(MiB/s)

virtiofs-cache-none     randread-psync          36(MiB/s)
virtiofs-cache-none-dax randread-psync          572(MiB/s)

virtiofs-cache-none     randread-psync-multi    211(MiB/s)
virtiofs-cache-none-dax randread-psync-multi    1764(MiB/s)

virtiofs-cache-none     randread-mmap           0(KiB/s)
virtiofs-cache-none-dax randread-mmap           719(MiB/s)

virtiofs-cache-none     randread-mmap-multi     0(KiB/s)
virtiofs-cache-none-dax randread-mmap-multi     2005(MiB/s)

virtiofs-cache-none     randread-libaio         300(MiB/s)
virtiofs-cache-none-dax randread-libaio         413(MiB/s)

virtiofs-cache-none     randread-libaio-multi   327(MiB/s)
virtiofs-cache-none-dax randread-libaio-multi   1326(MiB/s)

virtiofs-cache-none     seqwrite-psync          34(MiB/s)
virtiofs-cache-none-dax seqwrite-psync          494(MiB/s)

virtiofs-cache-none     seqwrite-psync-multi    223(MiB/s)
virtiofs-cache-none-dax seqwrite-psync-multi    1680(MiB/s)

virtiofs-cache-none     seqwrite-mmap           0(KiB/s)
virtiofs-cache-none-dax seqwrite-mmap           1217(MiB/s)

virtiofs-cache-none     seqwrite-mmap-multi     0(KiB/s)
virtiofs-cache-none-dax seqwrite-mmap-multi     2359(MiB/s)

virtiofs-cache-none     seqwrite-libaio         282(MiB/s)
virtiofs-cache-none-dax seqwrite-libaio         348(MiB/s)

virtiofs-cache-none     seqwrite-libaio-multi   320(MiB/s)
virtiofs-cache-none-dax seqwrite-libaio-multi   1255(MiB/s)

virtiofs-cache-none     randwrite-psync         32(MiB/s)
virtiofs-cache-none-dax randwrite-psync         458(MiB/s)

virtiofs-cache-none     randwrite-psync-multi   213(MiB/s)
virtiofs-cache-none-dax randwrite-psync-multi   1343(MiB/s)

virtiofs-cache-none     randwrite-mmap          0(KiB/s)
virtiofs-cache-none-dax randwrite-mmap          663(MiB/s)

virtiofs-cache-none     randwrite-mmap-multi    0(KiB/s)
virtiofs-cache-none-dax randwrite-mmap-multi    1820(MiB/s)

virtiofs-cache-none     randwrite-libaio        292(MiB/s)
virtiofs-cache-none-dax randwrite-libaio        341(MiB/s)

virtiofs-cache-none     randwrite-libaio-multi  322(MiB/s)
virtiofs-cache-none-dax randwrite-libaio-multi  1094(MiB/s)

Conclusion
===========
- virtio-fs with dax enabled is significantly faster and memory
  effiecient as comapred to non-dax operation.

Note:
  Right now dax window is 64G and max fio file size is 32G as well (4
  files of 8G each). That means everything fits into dax window and no
  reclaim is needed. Dax window reclaim logic is slower and if file
  size is bigger than dax window size, performance slows down.

Thanks
Vivek

Sebastien Boeuf (3):
  virtio: Add get_shm_region method
  virtio: Implement get_shm_region for PCI transport
  virtio: Implement get_shm_region for MMIO transport

Stefan Hajnoczi (2):
  virtio_fs, dax: Set up virtio_fs dax_device
  fuse,dax: add DAX mmap support

Vivek Goyal (15):
  dax: Modify bdev_dax_pgoff() to handle NULL bdev
  dax: Create a range version of dax_layout_busy_page()
  virtiofs: Provide a helper function for virtqueue initialization
  fuse: Get rid of no_mount_options
  fuse,virtiofs: Add a mount option to enable dax
  fuse,virtiofs: Keep a list of free dax memory ranges
  fuse: implement FUSE_INIT map_alignment field
  fuse: Introduce setupmapping/removemapping commands
  fuse, dax: Implement dax read/write operations
  fuse, dax: Take ->i_mmap_sem lock during dax page fault
  fuse,virtiofs: Define dax address space operations
  fuse,virtiofs: Maintain a list of busy elements
  fuse: Release file in process context
  fuse: Take inode lock for dax inode truncation
  fuse,virtiofs: Add logic to free up a memory range

 drivers/dax/super.c                |    3 +-
 drivers/virtio/virtio_mmio.c       |   32 +
 drivers/virtio/virtio_pci_modern.c |  107 +++
 fs/dax.c                           |   66 +-
 fs/fuse/dir.c                      |    2 +
 fs/fuse/file.c                     | 1162 +++++++++++++++++++++++++++-
 fs/fuse/fuse_i.h                   |  109 ++-
 fs/fuse/inode.c                    |  148 +++-
 fs/fuse/virtio_fs.c                |  250 +++++-
 include/linux/dax.h                |    6 +
 include/linux/virtio_config.h      |   17 +
 include/uapi/linux/fuse.h          |   42 +-
 include/uapi/linux/virtio_fs.h     |    3 +
 include/uapi/linux/virtio_mmio.h   |   11 +
 include/uapi/linux/virtio_pci.h    |   11 +-
 15 files changed, 1888 insertions(+), 81 deletions(-)

Comments

Amir Goldstein March 11, 2020, 5:22 a.m. UTC | #1
On Wed, Mar 4, 2020 at 7:01 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> Hi,
>
> This patch series adds DAX support to virtiofs filesystem. This allows
> bypassing guest page cache and allows mapping host page cache directly
> in guest address space.
>
> When a page of file is needed, guest sends a request to map that page
> (in host page cache) in qemu address space. Inside guest this is
> a physical memory range controlled by virtiofs device. And guest
> directly maps this physical address range using DAX and hence gets
> access to file data on host.
>
> This can speed up things considerably in many situations. Also this
> can result in substantial memory savings as file data does not have
> to be copied in guest and it is directly accessed from host page
> cache.
>
> Most of the changes are limited to fuse/virtiofs. There are couple
> of changes needed in generic dax infrastructure and couple of changes
> in virtio to be able to access shared memory region.
>
> These patches apply on top of 5.6-rc4 and are also available here.
>
> https://github.com/rhvgoyal/linux/commits/vivek-04-march-2020
>
> Any review or feedback is welcome.
>
[...]
>  drivers/dax/super.c                |    3 +-
>  drivers/virtio/virtio_mmio.c       |   32 +
>  drivers/virtio/virtio_pci_modern.c |  107 +++
>  fs/dax.c                           |   66 +-
>  fs/fuse/dir.c                      |    2 +
>  fs/fuse/file.c                     | 1162 +++++++++++++++++++++++++++-

That's a big addition to already big file.c.
Maybe split dax specific code to dax.c?
Can be a post series cleanup too.

Thanks,
Amir.
Vivek Goyal March 11, 2020, 1:09 p.m. UTC | #2
On Wed, Mar 11, 2020 at 07:22:51AM +0200, Amir Goldstein wrote:
> On Wed, Mar 4, 2020 at 7:01 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > Hi,
> >
> > This patch series adds DAX support to virtiofs filesystem. This allows
> > bypassing guest page cache and allows mapping host page cache directly
> > in guest address space.
> >
> > When a page of file is needed, guest sends a request to map that page
> > (in host page cache) in qemu address space. Inside guest this is
> > a physical memory range controlled by virtiofs device. And guest
> > directly maps this physical address range using DAX and hence gets
> > access to file data on host.
> >
> > This can speed up things considerably in many situations. Also this
> > can result in substantial memory savings as file data does not have
> > to be copied in guest and it is directly accessed from host page
> > cache.
> >
> > Most of the changes are limited to fuse/virtiofs. There are couple
> > of changes needed in generic dax infrastructure and couple of changes
> > in virtio to be able to access shared memory region.
> >
> > These patches apply on top of 5.6-rc4 and are also available here.
> >
> > https://github.com/rhvgoyal/linux/commits/vivek-04-march-2020
> >
> > Any review or feedback is welcome.
> >
> [...]
> >  drivers/dax/super.c                |    3 +-
> >  drivers/virtio/virtio_mmio.c       |   32 +
> >  drivers/virtio/virtio_pci_modern.c |  107 +++
> >  fs/dax.c                           |   66 +-
> >  fs/fuse/dir.c                      |    2 +
> >  fs/fuse/file.c                     | 1162 +++++++++++++++++++++++++++-
> 
> That's a big addition to already big file.c.
> Maybe split dax specific code to dax.c?
> Can be a post series cleanup too.

Lot of this code is coming from logic to reclaim dax memory range
assigned to inode. I will look into moving some of it to a separate
file.

Vivek
Patrick Ohly March 11, 2020, 1:38 p.m. UTC | #3
Vivek Goyal <vgoyal@redhat.com> writes:
> This patch series adds DAX support to virtiofs filesystem. This allows
> bypassing guest page cache and allows mapping host page cache directly
> in guest address space.
>
> When a page of file is needed, guest sends a request to map that page
> (in host page cache) in qemu address space. Inside guest this is
> a physical memory range controlled by virtiofs device. And guest
> directly maps this physical address range using DAX and hence gets
> access to file data on host.
>
> This can speed up things considerably in many situations. Also this
> can result in substantial memory savings as file data does not have
> to be copied in guest and it is directly accessed from host page
> cache.

As a potential user of this, let me make sure I understand the expected
outcome: is the goal to let virtiofs use DAX (for increased performance,
etc.) or also let applications that use virtiofs use DAX?

You are mentioning using the host's page cache, so it's probably the
former and MAP_SYNC on virtiofs will continue to be rejected, right?
Vivek Goyal March 11, 2020, 6:48 p.m. UTC | #4
On Wed, Mar 11, 2020 at 07:22:51AM +0200, Amir Goldstein wrote:
> On Wed, Mar 4, 2020 at 7:01 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > Hi,
> >
> > This patch series adds DAX support to virtiofs filesystem. This allows
> > bypassing guest page cache and allows mapping host page cache directly
> > in guest address space.
> >
> > When a page of file is needed, guest sends a request to map that page
> > (in host page cache) in qemu address space. Inside guest this is
> > a physical memory range controlled by virtiofs device. And guest
> > directly maps this physical address range using DAX and hence gets
> > access to file data on host.
> >
> > This can speed up things considerably in many situations. Also this
> > can result in substantial memory savings as file data does not have
> > to be copied in guest and it is directly accessed from host page
> > cache.
> >
> > Most of the changes are limited to fuse/virtiofs. There are couple
> > of changes needed in generic dax infrastructure and couple of changes
> > in virtio to be able to access shared memory region.
> >
> > These patches apply on top of 5.6-rc4 and are also available here.
> >
> > https://github.com/rhvgoyal/linux/commits/vivek-04-march-2020
> >
> > Any review or feedback is welcome.
> >
> [...]
> >  drivers/dax/super.c                |    3 +-
> >  drivers/virtio/virtio_mmio.c       |   32 +
> >  drivers/virtio/virtio_pci_modern.c |  107 +++
> >  fs/dax.c                           |   66 +-
> >  fs/fuse/dir.c                      |    2 +
> >  fs/fuse/file.c                     | 1162 +++++++++++++++++++++++++++-
> 
> That's a big addition to already big file.c.
> Maybe split dax specific code to dax.c?
> Can be a post series cleanup too.

How about fs/fuse/iomap.c instead. This will have all the iomap related logic
as well as all the dax range allocation/free logic which is required
by iomap logic. That moves about 900 lines of code from file.c to iomap.c

Vivek
Amir Goldstein March 11, 2020, 7:32 p.m. UTC | #5
On Wed, Mar 11, 2020 at 8:48 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Wed, Mar 11, 2020 at 07:22:51AM +0200, Amir Goldstein wrote:
> > On Wed, Mar 4, 2020 at 7:01 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > Hi,
> > >
> > > This patch series adds DAX support to virtiofs filesystem. This allows
> > > bypassing guest page cache and allows mapping host page cache directly
> > > in guest address space.
> > >
> > > When a page of file is needed, guest sends a request to map that page
> > > (in host page cache) in qemu address space. Inside guest this is
> > > a physical memory range controlled by virtiofs device. And guest
> > > directly maps this physical address range using DAX and hence gets
> > > access to file data on host.
> > >
> > > This can speed up things considerably in many situations. Also this
> > > can result in substantial memory savings as file data does not have
> > > to be copied in guest and it is directly accessed from host page
> > > cache.
> > >
> > > Most of the changes are limited to fuse/virtiofs. There are couple
> > > of changes needed in generic dax infrastructure and couple of changes
> > > in virtio to be able to access shared memory region.
> > >
> > > These patches apply on top of 5.6-rc4 and are also available here.
> > >
> > > https://github.com/rhvgoyal/linux/commits/vivek-04-march-2020
> > >
> > > Any review or feedback is welcome.
> > >
> > [...]
> > >  drivers/dax/super.c                |    3 +-
> > >  drivers/virtio/virtio_mmio.c       |   32 +
> > >  drivers/virtio/virtio_pci_modern.c |  107 +++
> > >  fs/dax.c                           |   66 +-
> > >  fs/fuse/dir.c                      |    2 +
> > >  fs/fuse/file.c                     | 1162 +++++++++++++++++++++++++++-
> >
> > That's a big addition to already big file.c.
> > Maybe split dax specific code to dax.c?
> > Can be a post series cleanup too.
>
> How about fs/fuse/iomap.c instead. This will have all the iomap related logic
> as well as all the dax range allocation/free logic which is required
> by iomap logic. That moves about 900 lines of code from file.c to iomap.c
>

Fine by me. I didn't take time to study the code in file.c
I just noticed is has grown a lot bigger and wasn't sure that
it made sense. Up to you. Only if you think the result would be nicer
to maintain.

Thanks,
Amir.
Vivek Goyal March 11, 2020, 7:39 p.m. UTC | #6
On Wed, Mar 11, 2020 at 09:32:17PM +0200, Amir Goldstein wrote:
> On Wed, Mar 11, 2020 at 8:48 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Wed, Mar 11, 2020 at 07:22:51AM +0200, Amir Goldstein wrote:
> > > On Wed, Mar 4, 2020 at 7:01 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > This patch series adds DAX support to virtiofs filesystem. This allows
> > > > bypassing guest page cache and allows mapping host page cache directly
> > > > in guest address space.
> > > >
> > > > When a page of file is needed, guest sends a request to map that page
> > > > (in host page cache) in qemu address space. Inside guest this is
> > > > a physical memory range controlled by virtiofs device. And guest
> > > > directly maps this physical address range using DAX and hence gets
> > > > access to file data on host.
> > > >
> > > > This can speed up things considerably in many situations. Also this
> > > > can result in substantial memory savings as file data does not have
> > > > to be copied in guest and it is directly accessed from host page
> > > > cache.
> > > >
> > > > Most of the changes are limited to fuse/virtiofs. There are couple
> > > > of changes needed in generic dax infrastructure and couple of changes
> > > > in virtio to be able to access shared memory region.
> > > >
> > > > These patches apply on top of 5.6-rc4 and are also available here.
> > > >
> > > > https://github.com/rhvgoyal/linux/commits/vivek-04-march-2020
> > > >
> > > > Any review or feedback is welcome.
> > > >
> > > [...]
> > > >  drivers/dax/super.c                |    3 +-
> > > >  drivers/virtio/virtio_mmio.c       |   32 +
> > > >  drivers/virtio/virtio_pci_modern.c |  107 +++
> > > >  fs/dax.c                           |   66 +-
> > > >  fs/fuse/dir.c                      |    2 +
> > > >  fs/fuse/file.c                     | 1162 +++++++++++++++++++++++++++-
> > >
> > > That's a big addition to already big file.c.
> > > Maybe split dax specific code to dax.c?
> > > Can be a post series cleanup too.
> >
> > How about fs/fuse/iomap.c instead. This will have all the iomap related logic
> > as well as all the dax range allocation/free logic which is required
> > by iomap logic. That moves about 900 lines of code from file.c to iomap.c
> >
> 
> Fine by me. I didn't take time to study the code in file.c
> I just noticed is has grown a lot bigger and wasn't sure that
> it made sense. Up to you. Only if you think the result would be nicer
> to maintain.

I am happy to move this code to a separate file. In fact I think we could
probably break it further into another file say dax-mapping.c or something
like that where all the memory range allocation/reclaim logic goes and
iomap logic remains in iomap.c.

But that's probably a future cleanup if code in this file continues to grow.

Vivek
Vivek Goyal March 16, 2020, 1:02 p.m. UTC | #7
On Wed, Mar 11, 2020 at 02:38:03PM +0100, Patrick Ohly wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> > This patch series adds DAX support to virtiofs filesystem. This allows
> > bypassing guest page cache and allows mapping host page cache directly
> > in guest address space.
> >
> > When a page of file is needed, guest sends a request to map that page
> > (in host page cache) in qemu address space. Inside guest this is
> > a physical memory range controlled by virtiofs device. And guest
> > directly maps this physical address range using DAX and hence gets
> > access to file data on host.
> >
> > This can speed up things considerably in many situations. Also this
> > can result in substantial memory savings as file data does not have
> > to be copied in guest and it is directly accessed from host page
> > cache.
> 
> As a potential user of this, let me make sure I understand the expected
> outcome: is the goal to let virtiofs use DAX (for increased performance,
> etc.) or also let applications that use virtiofs use DAX?
> 
> You are mentioning using the host's page cache, so it's probably the
> former and MAP_SYNC on virtiofs will continue to be rejected, right?

Hi Patrick,

You are right. Its the former. That is we want virtiofs to be able to
make use of DAX to bypass guest page cache. But there is no persistent
memory so no persistent memory programming semantics available to user
space. For that I guess we have virtio-pmem.

We expect users will issue fsync/msync like a regular filesystem to
make changes persistent. So in that aspect, rejecting MAP_SYNC
makes sense. I will test and see if current code is rejecting MAP_SYNC
or not.

Thanks
Vivek
Patrick Ohly March 17, 2020, 8:28 a.m. UTC | #8
Vivek Goyal <vgoyal@redhat.com> writes:
> We expect users will issue fsync/msync like a regular filesystem to
> make changes persistent. So in that aspect, rejecting MAP_SYNC
> makes sense. I will test and see if current code is rejecting MAP_SYNC
> or not.

Last time I checked, it did. Here's the test program that I wrote for
that:
https://github.com/intel/pmem-csi/blob/ee3200794a1ade49a02df6f359a134115b409e90/test/cmd/pmem-dax-check/main.go