mbox series

[00/35] Enhance memory utilization with DMEMFS

Message ID cover.1602093760.git.yuleixzhang@tencent.com (mailing list archive)
Headers show
Series Enhance memory utilization with DMEMFS | expand

Message

yulei zhang Oct. 8, 2020, 7:53 a.m. UTC
From: Yulei Zhang <yuleixzhang@tencent.com>

In current system each physical memory page is assocaited with
a page structure which is used to track the usage of this page.
But due to the memory usage rapidly growing in cloud environment,
we find the resource consuming for page structure storage becomes
highly remarkable. So is it an expense that we could spare?

This patchset introduces an idea about how to save the extra
memory through a new virtual filesystem -- dmemfs.

Dmemfs (Direct Memory filesystem) is device memory or reserved
memory based filesystem. This kind of memory is special as it
is not managed by kernel and most important it is without 'struct page'.
Therefore we can leverage the extra memory from the host system
to support more tenants in our cloud service.

We uses a kernel boot parameter 'dmem=' to reserve the system
memory when the host system boots up, the details can be checked
in /Documentation/admin-guide/kernel-parameters.txt. 

Theoretically for each 4k physical page it can save 64 bytes if
we drop the 'struct page', so for guest memory with 320G it can
save about 5G physical memory totally. 

Detailed usage of dmemfs is included in
/Documentation/filesystem/dmemfs.rst.

TODO:
1. we temporary disable the record_steal_time() before entering
guest, will enable that after solve the conflict.
2. working on systemcall such as mincore, will update the status
and patches soon. 

Yulei Zhang (35):
  fs: introduce dmemfs module
  mm: support direct memory reservation
  dmem: implement dmem memory management
  dmem: let pat recognize dmem
  dmemfs: support mmap
  dmemfs: support truncating inode down
  dmem: trace core functions
  dmem: show some statistic in debugfs
  dmemfs: support remote access
  dmemfs: introduce max_alloc_try_dpages parameter
  mm: export mempolicy interfaces to serve dmem allocator
  dmem: introduce mempolicy support
  mm, dmem: introduce PFN_DMEM and pfn_t_dmem
  mm, dmem: dmem-pmd vs thp-pmd
  mm: add pmd_special() check for pmd_trans_huge_lock()
  dmemfs: introduce ->split() to dmemfs_vm_ops
  mm, dmemfs: support unmap_page_range() for dmemfs pmd
  mm: follow_pmd_mask() for dmem huge pmd
  mm: gup_huge_pmd() for dmem huge pmd
  mm: support dmem huge pmd for vmf_insert_pfn_pmd()
  mm: support dmem huge pmd for follow_pfn()
  kvm, x86: Distinguish dmemfs page from mmio page
  kvm, x86: introduce VM_DMEM
  dmemfs: support hugepage for dmemfs
  mm, x86, dmem: fix estimation of reserved page for vaddr_get_pfn()
  mm, dmem: introduce pud_special()
  mm: add pud_special() to support dmem huge pud
  mm, dmemfs: support huge_fault() for dmemfs
  mm: add follow_pte_pud()
  dmem: introduce dmem_bitmap_alloc() and dmem_bitmap_free()
  dmem: introduce mce handler
  mm, dmemfs: register and handle the dmem mce
  kvm, x86: temporary disable record_steal_time for dmem
  dmem: add dmem unit tests
  Add documentation for dmemfs

 .../admin-guide/kernel-parameters.txt         |   38 +
 Documentation/filesystems/dmemfs.rst          |   59 +
 arch/x86/Kconfig                              |    1 +
 arch/x86/include/asm/pgtable.h                |   32 +-
 arch/x86/include/asm/pgtable_types.h          |   13 +-
 arch/x86/kernel/setup.c                       |    3 +
 arch/x86/kvm/mmu/mmu.c                        |    5 +-
 arch/x86/kvm/x86.c                            |    2 +
 arch/x86/mm/pat/memtype.c                     |   21 +
 drivers/vfio/vfio_iommu_type1.c               |    4 +
 fs/Kconfig                                    |    1 +
 fs/Makefile                                   |    1 +
 fs/dmemfs/Kconfig                             |   16 +
 fs/dmemfs/Makefile                            |    8 +
 fs/dmemfs/inode.c                             | 1063 ++++++++++++++++
 fs/dmemfs/trace.h                             |   54 +
 fs/inode.c                                    |    6 +
 include/linux/dmem.h                          |   49 +
 include/linux/fs.h                            |    1 +
 include/linux/huge_mm.h                       |    5 +-
 include/linux/mempolicy.h                     |    3 +
 include/linux/mm.h                            |    9 +
 include/linux/pfn_t.h                         |   17 +-
 include/linux/pgtable.h                       |   22 +
 include/trace/events/dmem.h                   |   85 ++
 include/uapi/linux/magic.h                    |    1 +
 mm/Kconfig                                    |   21 +
 mm/Makefile                                   |    1 +
 mm/dmem.c                                     | 1075 +++++++++++++++++
 mm/dmem_reserve.c                             |  303 +++++
 mm/gup.c                                      |   94 +-
 mm/huge_memory.c                              |   19 +-
 mm/memory-failure.c                           |   69 +-
 mm/memory.c                                   |   74 +-
 mm/mempolicy.c                                |    4 +-
 mm/mprotect.c                                 |    7 +-
 mm/mremap.c                                   |    3 +
 tools/testing/dmem/Kbuild                     |    1 +
 tools/testing/dmem/Makefile                   |   10 +
 tools/testing/dmem/dmem-test.c                |  184 +++
 40 files changed, 3336 insertions(+), 48 deletions(-)
 create mode 100644 Documentation/filesystems/dmemfs.rst
 create mode 100644 fs/dmemfs/Kconfig
 create mode 100644 fs/dmemfs/Makefile
 create mode 100644 fs/dmemfs/inode.c
 create mode 100644 fs/dmemfs/trace.h
 create mode 100644 include/linux/dmem.h
 create mode 100644 include/trace/events/dmem.h
 create mode 100644 mm/dmem.c
 create mode 100644 mm/dmem_reserve.c
 create mode 100644 tools/testing/dmem/Kbuild
 create mode 100644 tools/testing/dmem/Makefile
 create mode 100644 tools/testing/dmem/dmem-test.c

Comments

Joao Martins Oct. 8, 2020, 7:01 p.m. UTC | #1
[adding a couple folks that directly or indirectly work on the subject]

On 10/8/20 8:53 AM, yulei.kernel@gmail.com wrote:
> From: Yulei Zhang <yuleixzhang@tencent.com>
> 
> In current system each physical memory page is assocaited with
> a page structure which is used to track the usage of this page.
> But due to the memory usage rapidly growing in cloud environment,
> we find the resource consuming for page structure storage becomes
> highly remarkable. So is it an expense that we could spare?
> 
Happy to see another person working to solve the same problem!

I am really glad to see more folks being interested in solving
this problem and I hope we can join efforts?

BTW, there is also a second benefit in removing struct page -
which is carving out memory from the direct map.

> This patchset introduces an idea about how to save the extra
> memory through a new virtual filesystem -- dmemfs.
> 
> Dmemfs (Direct Memory filesystem) is device memory or reserved
> memory based filesystem. This kind of memory is special as it
> is not managed by kernel and most important it is without 'struct page'.
> Therefore we can leverage the extra memory from the host system
> to support more tenants in our cloud service.
> 
This is like a walk down the memory lane.

About a year ago we followed the same exact idea/motivation to
have memory outside of the direct map (and removing struct page overhead)
and started with our own layer/thingie. However we realized that DAX
is one the subsystems which already gives you direct access to memory
for free (and is already upstream), plus a couple of things which we
found more handy.

So we sent an RFC a couple months ago:

https://lore.kernel.org/linux-mm/20200110190313.17144-1-joao.m.martins@oracle.com/

Since then majority of the work has been in improving DAX[1].
But now that is done I am going to follow up with the above patchset.

[1]
https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.stgit@dwillia2-desk3.amr.corp.intel.com/

(Give me a couple of days and I will send you the link to the latest
patches on a git-tree - would love feedback!)

The struct page removal for DAX would then be small, and ticks the
same bells and whistles (MCE handling, reserving PAT memtypes, ptrace
support) that we both do, with a smaller diffstat and it doesn't
touch KVM (not at least fundamentally).

	15 files changed, 401 insertions(+), 38 deletions(-)

The things needed in core-mm is for handling PMD/PUD PAGE_SPECIAL much
like we both do. Furthermore there wouldn't be a need for a new vm type,
consuming an extra page bit (in addition to PAGE_SPECIAL) or new filesystem.

[1]
https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.stgit@dwillia2-desk3.amr.corp.intel.com/


> We uses a kernel boot parameter 'dmem=' to reserve the system
> memory when the host system boots up, the details can be checked
> in /Documentation/admin-guide/kernel-parameters.txt. 
> 
> Theoretically for each 4k physical page it can save 64 bytes if
> we drop the 'struct page', so for guest memory with 320G it can
> save about 5G physical memory totally. 
> 
Also worth mentioning that if you only care about 'struct page' cost, and not on the
security boundary, there's also some work on hugetlbfs preallocation of hugepages into
tricking vmemmap in reusing tail pages.

  https://lore.kernel.org/linux-mm/20200915125947.26204-1-songmuchun@bytedance.com/

Going forward that could also make sense for device-dax to avoid so many
struct pages allocated (which would require its transition to compound
struct pages like hugetlbfs which we are looking at too). In addition an
idea <handwaving> would be perhaps to have a stricter mode in DAX where
we initialize/use the metadata ('struct page') but remove the underlaying
PFNs (of the 'struct page') from the direct map having to bear the cost of
mapping/unmapping on gup/pup.

	Joao
yulei zhang Oct. 9, 2020, 11:39 a.m. UTC | #2
Joao, thanks a lot for the feedback. One more thing needs to mention
is that dmemfs also support fine-grained
memory management which makes it more flexible for tenants with
different requirements.

On Fri, Oct 9, 2020 at 3:01 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> [adding a couple folks that directly or indirectly work on the subject]
>
> On 10/8/20 8:53 AM, yulei.kernel@gmail.com wrote:
> > From: Yulei Zhang <yuleixzhang@tencent.com>
> >
> > In current system each physical memory page is assocaited with
> > a page structure which is used to track the usage of this page.
> > But due to the memory usage rapidly growing in cloud environment,
> > we find the resource consuming for page structure storage becomes
> > highly remarkable. So is it an expense that we could spare?
> >
> Happy to see another person working to solve the same problem!
>
> I am really glad to see more folks being interested in solving
> this problem and I hope we can join efforts?
>
> BTW, there is also a second benefit in removing struct page -
> which is carving out memory from the direct map.
>
> > This patchset introduces an idea about how to save the extra
> > memory through a new virtual filesystem -- dmemfs.
> >
> > Dmemfs (Direct Memory filesystem) is device memory or reserved
> > memory based filesystem. This kind of memory is special as it
> > is not managed by kernel and most important it is without 'struct page'.
> > Therefore we can leverage the extra memory from the host system
> > to support more tenants in our cloud service.
> >
> This is like a walk down the memory lane.
>
> About a year ago we followed the same exact idea/motivation to
> have memory outside of the direct map (and removing struct page overhead)
> and started with our own layer/thingie. However we realized that DAX
> is one the subsystems which already gives you direct access to memory
> for free (and is already upstream), plus a couple of things which we
> found more handy.
>
> So we sent an RFC a couple months ago:
>
> https://lore.kernel.org/linux-mm/20200110190313.17144-1-joao.m.martins@oracle.com/
>
> Since then majority of the work has been in improving DAX[1].
> But now that is done I am going to follow up with the above patchset.
>
> [1]
> https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.stgit@dwillia2-desk3.amr.corp.intel.com/
>
> (Give me a couple of days and I will send you the link to the latest
> patches on a git-tree - would love feedback!)
>
> The struct page removal for DAX would then be small, and ticks the
> same bells and whistles (MCE handling, reserving PAT memtypes, ptrace
> support) that we both do, with a smaller diffstat and it doesn't
> touch KVM (not at least fundamentally).
>
>         15 files changed, 401 insertions(+), 38 deletions(-)
>
> The things needed in core-mm is for handling PMD/PUD PAGE_SPECIAL much
> like we both do. Furthermore there wouldn't be a need for a new vm type,
> consuming an extra page bit (in addition to PAGE_SPECIAL) or new filesystem.
>
> [1]
> https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.stgit@dwillia2-desk3.amr.corp.intel.com/
>
>
> > We uses a kernel boot parameter 'dmem=' to reserve the system
> > memory when the host system boots up, the details can be checked
> > in /Documentation/admin-guide/kernel-parameters.txt.
> >
> > Theoretically for each 4k physical page it can save 64 bytes if
> > we drop the 'struct page', so for guest memory with 320G it can
> > save about 5G physical memory totally.
> >
> Also worth mentioning that if you only care about 'struct page' cost, and not on the
> security boundary, there's also some work on hugetlbfs preallocation of hugepages into
> tricking vmemmap in reusing tail pages.
>
>   https://lore.kernel.org/linux-mm/20200915125947.26204-1-songmuchun@bytedance.com/
>
> Going forward that could also make sense for device-dax to avoid so many
> struct pages allocated (which would require its transition to compound
> struct pages like hugetlbfs which we are looking at too). In addition an
> idea <handwaving> would be perhaps to have a stricter mode in DAX where
> we initialize/use the metadata ('struct page') but remove the underlaying
> PFNs (of the 'struct page') from the direct map having to bear the cost of
> mapping/unmapping on gup/pup.
>
>         Joao
Joao Martins Oct. 9, 2020, 11:53 a.m. UTC | #3
On 10/9/20 12:39 PM, yulei zhang wrote:
> Joao, thanks a lot for the feedback. One more thing needs to mention
> is that dmemfs also support fine-grained
> memory management which makes it more flexible for tenants with
> different requirements.
> 
So as DAX when it allows to partition a region (starting 5.10). Meaning you have a region
which you dedicated to userspace. That region can then be partitioning into devices which
give you access to multiple (possibly discontinuous) extents with at a given page
granularity (selectable when you create the device), accessed through mmap().
You can then give that device to a cgroup. Or you can return that memory back to the
kernel (should you run into OOM situation), or you recreate the same mappings across
reboot/kexec.

I probably need to read your patches again, but can you extend on the 'dmemfs also support
fine-grained memory management' to understand what is the gap that you mention?

> On Fri, Oct 9, 2020 at 3:01 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> [adding a couple folks that directly or indirectly work on the subject]
>>
>> On 10/8/20 8:53 AM, yulei.kernel@gmail.com wrote:
>>> From: Yulei Zhang <yuleixzhang@tencent.com>
>>>
>>> In current system each physical memory page is assocaited with
>>> a page structure which is used to track the usage of this page.
>>> But due to the memory usage rapidly growing in cloud environment,
>>> we find the resource consuming for page structure storage becomes
>>> highly remarkable. So is it an expense that we could spare?
>>>
>> Happy to see another person working to solve the same problem!
>>
>> I am really glad to see more folks being interested in solving
>> this problem and I hope we can join efforts?
>>
>> BTW, there is also a second benefit in removing struct page -
>> which is carving out memory from the direct map.
>>
>>> This patchset introduces an idea about how to save the extra
>>> memory through a new virtual filesystem -- dmemfs.
>>>
>>> Dmemfs (Direct Memory filesystem) is device memory or reserved
>>> memory based filesystem. This kind of memory is special as it
>>> is not managed by kernel and most important it is without 'struct page'.
>>> Therefore we can leverage the extra memory from the host system
>>> to support more tenants in our cloud service.
>>>
>> This is like a walk down the memory lane.
>>
>> About a year ago we followed the same exact idea/motivation to
>> have memory outside of the direct map (and removing struct page overhead)
>> and started with our own layer/thingie. However we realized that DAX
>> is one the subsystems which already gives you direct access to memory
>> for free (and is already upstream), plus a couple of things which we
>> found more handy.
>>
>> So we sent an RFC a couple months ago:
>>
>> https://lore.kernel.org/linux-mm/20200110190313.17144-1-joao.m.martins@oracle.com/
>>
>> Since then majority of the work has been in improving DAX[1].
>> But now that is done I am going to follow up with the above patchset.
>>
>> [1]
>> https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.stgit@dwillia2-desk3.amr.corp.intel.com/
>>
>> (Give me a couple of days and I will send you the link to the latest
>> patches on a git-tree - would love feedback!)
>>
>> The struct page removal for DAX would then be small, and ticks the
>> same bells and whistles (MCE handling, reserving PAT memtypes, ptrace
>> support) that we both do, with a smaller diffstat and it doesn't
>> touch KVM (not at least fundamentally).
>>
>>         15 files changed, 401 insertions(+), 38 deletions(-)
>>
>> The things needed in core-mm is for handling PMD/PUD PAGE_SPECIAL much
>> like we both do. Furthermore there wouldn't be a need for a new vm type,
>> consuming an extra page bit (in addition to PAGE_SPECIAL) or new filesystem.
>>
>> [1]
>> https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.stgit@dwillia2-desk3.amr.corp.intel.com/
>>
>>
>>> We uses a kernel boot parameter 'dmem=' to reserve the system
>>> memory when the host system boots up, the details can be checked
>>> in /Documentation/admin-guide/kernel-parameters.txt.
>>>
>>> Theoretically for each 4k physical page it can save 64 bytes if
>>> we drop the 'struct page', so for guest memory with 320G it can
>>> save about 5G physical memory totally.
>>>
>> Also worth mentioning that if you only care about 'struct page' cost, and not on the
>> security boundary, there's also some work on hugetlbfs preallocation of hugepages into
>> tricking vmemmap in reusing tail pages.
>>
>>   https://lore.kernel.org/linux-mm/20200915125947.26204-1-songmuchun@bytedance.com/
>>
>> Going forward that could also make sense for device-dax to avoid so many
>> struct pages allocated (which would require its transition to compound
>> struct pages like hugetlbfs which we are looking at too). In addition an
>> idea <handwaving> would be perhaps to have a stricter mode in DAX where
>> we initialize/use the metadata ('struct page') but remove the underlaying
>> PFNs (of the 'struct page') from the direct map having to bear the cost of
>> mapping/unmapping on gup/pup.
>>
>>         Joao
yulei zhang Oct. 10, 2020, 8:15 a.m. UTC | #4
On Fri, Oct 9, 2020 at 7:53 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 10/9/20 12:39 PM, yulei zhang wrote:
> > Joao, thanks a lot for the feedback. One more thing needs to mention
> > is that dmemfs also support fine-grained
> > memory management which makes it more flexible for tenants with
> > different requirements.
> >
> So as DAX when it allows to partition a region (starting 5.10). Meaning you have a region
> which you dedicated to userspace. That region can then be partitioning into devices which
> give you access to multiple (possibly discontinuous) extents with at a given page
> granularity (selectable when you create the device), accessed through mmap().
> You can then give that device to a cgroup. Or you can return that memory back to the
> kernel (should you run into OOM situation), or you recreate the same mappings across
> reboot/kexec.
>
> I probably need to read your patches again, but can you extend on the 'dmemfs also support
> fine-grained memory management' to understand what is the gap that you mention?
>

sure, dmemfs uses bitmap to track the memory usage in the reserved
memory region in
a given page size granularity. And for each user the memory can be
discrete as well.

> > On Fri, Oct 9, 2020 at 3:01 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> >>
> >> [adding a couple folks that directly or indirectly work on the subject]
> >>
> >> On 10/8/20 8:53 AM, yulei.kernel@gmail.com wrote:
> >>> From: Yulei Zhang <yuleixzhang@tencent.com>
> >>>
> >>> In current system each physical memory page is assocaited with
> >>> a page structure which is used to track the usage of this page.
> >>> But due to the memory usage rapidly growing in cloud environment,
> >>> we find the resource consuming for page structure storage becomes
> >>> highly remarkable. So is it an expense that we could spare?
> >>>
> >> Happy to see another person working to solve the same problem!
> >>
> >> I am really glad to see more folks being interested in solving
> >> this problem and I hope we can join efforts?
> >>
> >> BTW, there is also a second benefit in removing struct page -
> >> which is carving out memory from the direct map.
> >>
> >>> This patchset introduces an idea about how to save the extra
> >>> memory through a new virtual filesystem -- dmemfs.
> >>>
> >>> Dmemfs (Direct Memory filesystem) is device memory or reserved
> >>> memory based filesystem. This kind of memory is special as it
> >>> is not managed by kernel and most important it is without 'struct page'.
> >>> Therefore we can leverage the extra memory from the host system
> >>> to support more tenants in our cloud service.
> >>>
> >> This is like a walk down the memory lane.
> >>
> >> About a year ago we followed the same exact idea/motivation to
> >> have memory outside of the direct map (and removing struct page overhead)
> >> and started with our own layer/thingie. However we realized that DAX
> >> is one the subsystems which already gives you direct access to memory
> >> for free (and is already upstream), plus a couple of things which we
> >> found more handy.
> >>
> >> So we sent an RFC a couple months ago:
> >>
> >> https://lore.kernel.org/linux-mm/20200110190313.17144-1-joao.m.martins@oracle.com/
> >>
> >> Since then majority of the work has been in improving DAX[1].
> >> But now that is done I am going to follow up with the above patchset.
> >>
> >> [1]
> >> https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.stgit@dwillia2-desk3.amr.corp.intel.com/
> >>
> >> (Give me a couple of days and I will send you the link to the latest
> >> patches on a git-tree - would love feedback!)
> >>
> >> The struct page removal for DAX would then be small, and ticks the
> >> same bells and whistles (MCE handling, reserving PAT memtypes, ptrace
> >> support) that we both do, with a smaller diffstat and it doesn't
> >> touch KVM (not at least fundamentally).
> >>
> >>         15 files changed, 401 insertions(+), 38 deletions(-)
> >>
> >> The things needed in core-mm is for handling PMD/PUD PAGE_SPECIAL much
> >> like we both do. Furthermore there wouldn't be a need for a new vm type,
> >> consuming an extra page bit (in addition to PAGE_SPECIAL) or new filesystem.
> >>
> >> [1]
> >> https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.stgit@dwillia2-desk3.amr.corp.intel.com/
> >>
> >>
> >>> We uses a kernel boot parameter 'dmem=' to reserve the system
> >>> memory when the host system boots up, the details can be checked
> >>> in /Documentation/admin-guide/kernel-parameters.txt.
> >>>
> >>> Theoretically for each 4k physical page it can save 64 bytes if
> >>> we drop the 'struct page', so for guest memory with 320G it can
> >>> save about 5G physical memory totally.
> >>>
> >> Also worth mentioning that if you only care about 'struct page' cost, and not on the
> >> security boundary, there's also some work on hugetlbfs preallocation of hugepages into
> >> tricking vmemmap in reusing tail pages.
> >>
> >>   https://lore.kernel.org/linux-mm/20200915125947.26204-1-songmuchun@bytedance.com/
> >>
> >> Going forward that could also make sense for device-dax to avoid so many
> >> struct pages allocated (which would require its transition to compound
> >> struct pages like hugetlbfs which we are looking at too). In addition an
> >> idea <handwaving> would be perhaps to have a stricter mode in DAX where
> >> we initialize/use the metadata ('struct page') but remove the underlaying
> >> PFNs (of the 'struct page') from the direct map having to bear the cost of
> >> mapping/unmapping on gup/pup.
> >>
> >>         Joao
Joao Martins Oct. 12, 2020, 10:59 a.m. UTC | #5
On 10/10/20 9:15 AM, yulei zhang wrote:
> On Fri, Oct 9, 2020 at 7:53 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>> On 10/9/20 12:39 PM, yulei zhang wrote:
>>> Joao, thanks a lot for the feedback. One more thing needs to mention
>>> is that dmemfs also support fine-grained
>>> memory management which makes it more flexible for tenants with
>>> different requirements.
>>>
>> So as DAX when it allows to partition a region (starting 5.10). Meaning you have a region
>> which you dedicated to userspace. That region can then be partitioning into devices which
>> give you access to multiple (possibly discontinuous) extents with at a given page
>> granularity (selectable when you create the device), accessed through mmap().
>> You can then give that device to a cgroup. Or you can return that memory back to the
>> kernel (should you run into OOM situation), or you recreate the same mappings across
>> reboot/kexec.
>>
>> I probably need to read your patches again, but can you extend on the 'dmemfs also support
>> fine-grained memory management' to understand what is the gap that you mention?
>>
> sure, dmemfs uses bitmap to track the memory usage in the reserved
> memory region in
> a given page size granularity. And for each user the memory can be
> discrete as well.
> 
That same functionality of tracking reserved region usage across different users at any
page granularity is covered the DAX series I mentioned below. The discrete part -- IIUC
what you meant -- is then reduced using DAX ABI/tools to create a device file vs a filesystem.

>>> On Fri, Oct 9, 2020 at 3:01 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>
>>>> [adding a couple folks that directly or indirectly work on the subject]
>>>>
>>>> On 10/8/20 8:53 AM, yulei.kernel@gmail.com wrote:
>>>>> From: Yulei Zhang <yuleixzhang@tencent.com>
>>>>>
>>>>> In current system each physical memory page is assocaited with
>>>>> a page structure which is used to track the usage of this page.
>>>>> But due to the memory usage rapidly growing in cloud environment,
>>>>> we find the resource consuming for page structure storage becomes
>>>>> highly remarkable. So is it an expense that we could spare?
>>>>>
>>>> Happy to see another person working to solve the same problem!
>>>>
>>>> I am really glad to see more folks being interested in solving
>>>> this problem and I hope we can join efforts?
>>>>
>>>> BTW, there is also a second benefit in removing struct page -
>>>> which is carving out memory from the direct map.
>>>>
>>>>> This patchset introduces an idea about how to save the extra
>>>>> memory through a new virtual filesystem -- dmemfs.
>>>>>
>>>>> Dmemfs (Direct Memory filesystem) is device memory or reserved
>>>>> memory based filesystem. This kind of memory is special as it
>>>>> is not managed by kernel and most important it is without 'struct page'.
>>>>> Therefore we can leverage the extra memory from the host system
>>>>> to support more tenants in our cloud service.
>>>>>
>>>> This is like a walk down the memory lane.
>>>>
>>>> About a year ago we followed the same exact idea/motivation to
>>>> have memory outside of the direct map (and removing struct page overhead)
>>>> and started with our own layer/thingie. However we realized that DAX
>>>> is one the subsystems which already gives you direct access to memory
>>>> for free (and is already upstream), plus a couple of things which we
>>>> found more handy.
>>>>
>>>> So we sent an RFC a couple months ago:
>>>>
>>>> https://lore.kernel.org/linux-mm/20200110190313.17144-1-joao.m.martins@oracle.com/
>>>>
>>>> Since then majority of the work has been in improving DAX[1].
>>>> But now that is done I am going to follow up with the above patchset.
>>>>
>>>> [1]
>>>> https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.stgit@dwillia2-desk3.amr.corp.intel.com/
>>>>
>>>> (Give me a couple of days and I will send you the link to the latest
>>>> patches on a git-tree - would love feedback!)
>>>>
>>>> The struct page removal for DAX would then be small, and ticks the
>>>> same bells and whistles (MCE handling, reserving PAT memtypes, ptrace
>>>> support) that we both do, with a smaller diffstat and it doesn't
>>>> touch KVM (not at least fundamentally).
>>>>
>>>>         15 files changed, 401 insertions(+), 38 deletions(-)
>>>>
>>>> The things needed in core-mm is for handling PMD/PUD PAGE_SPECIAL much
>>>> like we both do. Furthermore there wouldn't be a need for a new vm type,
>>>> consuming an extra page bit (in addition to PAGE_SPECIAL) or new filesystem.
>>>>
>>>> [1]
>>>> https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.stgit@dwillia2-desk3.amr.corp.intel.com/
>>>>
>>>>
>>>>> We uses a kernel boot parameter 'dmem=' to reserve the system
>>>>> memory when the host system boots up, the details can be checked
>>>>> in /Documentation/admin-guide/kernel-parameters.txt.
>>>>>
>>>>> Theoretically for each 4k physical page it can save 64 bytes if
>>>>> we drop the 'struct page', so for guest memory with 320G it can
>>>>> save about 5G physical memory totally.
>>>>>
>>>> Also worth mentioning that if you only care about 'struct page' cost, and not on the
>>>> security boundary, there's also some work on hugetlbfs preallocation of hugepages into
>>>> tricking vmemmap in reusing tail pages.
>>>>
>>>>   https://lore.kernel.org/linux-mm/20200915125947.26204-1-songmuchun@bytedance.com/
>>>>
>>>> Going forward that could also make sense for device-dax to avoid so many
>>>> struct pages allocated (which would require its transition to compound
>>>> struct pages like hugetlbfs which we are looking at too). In addition an
>>>> idea <handwaving> would be perhaps to have a stricter mode in DAX where
>>>> we initialize/use the metadata ('struct page') but remove the underlaying
>>>> PFNs (of the 'struct page') from the direct map having to bear the cost of
>>>> mapping/unmapping on gup/pup.
>>>>
>>>>         Joao
Zengtao (B) Oct. 12, 2020, 11:57 a.m. UTC | #6
> -----Original Message-----
> From: yulei.kernel@gmail.com [mailto:yulei.kernel@gmail.com]
> Sent: Thursday, October 08, 2020 3:54 PM
> To: akpm@linux-foundation.org; naoya.horiguchi@nec.com;
> viro@zeniv.linux.org.uk; pbonzini@redhat.com
> Cc: linux-fsdevel@vger.kernel.org; kvm@vger.kernel.org;
> linux-kernel@vger.kernel.org; xiaoguangrong.eric@gmail.com;
> kernellwp@gmail.com; lihaiwei.kernel@gmail.com; Yulei Zhang
> Subject: [PATCH 00/35] Enhance memory utilization with DMEMFS
> 
> From: Yulei Zhang <yuleixzhang@tencent.com>
> 
> In current system each physical memory page is assocaited with
> a page structure which is used to track the usage of this page.
> But due to the memory usage rapidly growing in cloud environment,
> we find the resource consuming for page structure storage becomes
> highly remarkable. So is it an expense that we could spare?
> 
> This patchset introduces an idea about how to save the extra
> memory through a new virtual filesystem -- dmemfs.
> 
> Dmemfs (Direct Memory filesystem) is device memory or reserved
> memory based filesystem. This kind of memory is special as it
> is not managed by kernel and most important it is without 'struct page'.
> Therefore we can leverage the extra memory from the host system
> to support more tenants in our cloud service.
> 
> We uses a kernel boot parameter 'dmem=' to reserve the system
> memory when the host system boots up, the details can be checked
> in /Documentation/admin-guide/kernel-parameters.txt.
> 
> Theoretically for each 4k physical page it can save 64 bytes if
> we drop the 'struct page', so for guest memory with 320G it can
> save about 5G physical memory totally.

Sounds interesting, but seems your patch only support x86, have you
 considered aarch64?

Regards
Zengtao
yulei zhang Oct. 13, 2020, 2:45 a.m. UTC | #7
On Mon, Oct 12, 2020 at 7:57 PM Zengtao (B) <prime.zeng@hisilicon.com> wrote:
>
>
> > -----Original Message-----
> > From: yulei.kernel@gmail.com [mailto:yulei.kernel@gmail.com]
> > Sent: Thursday, October 08, 2020 3:54 PM
> > To: akpm@linux-foundation.org; naoya.horiguchi@nec.com;
> > viro@zeniv.linux.org.uk; pbonzini@redhat.com
> > Cc: linux-fsdevel@vger.kernel.org; kvm@vger.kernel.org;
> > linux-kernel@vger.kernel.org; xiaoguangrong.eric@gmail.com;
> > kernellwp@gmail.com; lihaiwei.kernel@gmail.com; Yulei Zhang
> > Subject: [PATCH 00/35] Enhance memory utilization with DMEMFS
> >
> > From: Yulei Zhang <yuleixzhang@tencent.com>
> >
> > In current system each physical memory page is assocaited with
> > a page structure which is used to track the usage of this page.
> > But due to the memory usage rapidly growing in cloud environment,
> > we find the resource consuming for page structure storage becomes
> > highly remarkable. So is it an expense that we could spare?
> >
> > This patchset introduces an idea about how to save the extra
> > memory through a new virtual filesystem -- dmemfs.
> >
> > Dmemfs (Direct Memory filesystem) is device memory or reserved
> > memory based filesystem. This kind of memory is special as it
> > is not managed by kernel and most important it is without 'struct page'.
> > Therefore we can leverage the extra memory from the host system
> > to support more tenants in our cloud service.
> >
> > We uses a kernel boot parameter 'dmem=' to reserve the system
> > memory when the host system boots up, the details can be checked
> > in /Documentation/admin-guide/kernel-parameters.txt.
> >
> > Theoretically for each 4k physical page it can save 64 bytes if
> > we drop the 'struct page', so for guest memory with 320G it can
> > save about 5G physical memory totally.
>
> Sounds interesting, but seems your patch only support x86, have you
>  considered aarch64?
>
> Regards
> Zengtao

Thanks, so far we only verify it on x86 server, may extend to arm platform
in the future.
Dan Williams Oct. 14, 2020, 10:25 p.m. UTC | #8
On Mon, Oct 12, 2020 at 4:00 AM Joao Martins <joao.m.martins@oracle.com> wrote:
[..]
> On 10/10/20 9:15 AM, yulei zhang wrote:
> > On Fri, Oct 9, 2020 at 7:53 PM Joao Martins <joao.m.martins@oracle.com> wrote:
> >> On 10/9/20 12:39 PM, yulei zhang wrote:
> >>> Joao, thanks a lot for the feedback. One more thing needs to mention
> >>> is that dmemfs also support fine-grained
> >>> memory management which makes it more flexible for tenants with
> >>> different requirements.
> >>>
> >> So as DAX when it allows to partition a region (starting 5.10). Meaning you have a region
> >> which you dedicated to userspace. That region can then be partitioning into devices which
> >> give you access to multiple (possibly discontinuous) extents with at a given page
> >> granularity (selectable when you create the device), accessed through mmap().
> >> You can then give that device to a cgroup. Or you can return that memory back to the
> >> kernel (should you run into OOM situation), or you recreate the same mappings across
> >> reboot/kexec.
> >>
> >> I probably need to read your patches again, but can you extend on the 'dmemfs also support
> >> fine-grained memory management' to understand what is the gap that you mention?
> >>
> > sure, dmemfs uses bitmap to track the memory usage in the reserved
> > memory region in
> > a given page size granularity. And for each user the memory can be
> > discrete as well.
> >
> That same functionality of tracking reserved region usage across different users at any
> page granularity is covered the DAX series I mentioned below. The discrete part -- IIUC
> what you meant -- is then reduced using DAX ABI/tools to create a device file vs a filesystem.

Put another way. Linux already has a fine grained memory management
system, the page allocator. Now, with recent device-dax extensions, it
also has a coarse grained memory management system for  physical
address-space partitioning and a path for struct-page-less backing for
VMs. What feature gaps remain vs dmemfs, and can those gaps be closed
with incremental improvements to the 2 existing memory-management
systems?
Paolo Bonzini Oct. 19, 2020, 1:37 p.m. UTC | #9
On 15/10/20 00:25, Dan Williams wrote:
> Now, with recent device-dax extensions, it
> also has a coarse grained memory management system for  physical
> address-space partitioning and a path for struct-page-less backing for
> VMs. What feature gaps remain vs dmemfs, and can those gaps be closed
> with incremental improvements to the 2 existing memory-management
> systems?

If I understand correctly, devm_memremap_pages() on ZONE_DEVICE memory
would still create the "struct page" albeit lazily?  KVM then would use
the usual get_user_pages() path.

Looking more closely at the implementation of dmemfs, I don't understand
is why dmemfs needs VM_DMEM etc. and cannot provide access to mmap-ed
memory using remap_pfn_range and VM_PFNMAP, just like /dev/mem.  If it
did that KVM would get physical addresses using fixup_user_fault and
never need pfn_to_page() or get_user_pages().  I'm not saying that would
instantly be an approval, but it would make remove a lot of hooks.

Paolo
Joao Martins Oct. 19, 2020, 7:03 p.m. UTC | #10
On 10/19/20 2:37 PM, Paolo Bonzini wrote:
> On 15/10/20 00:25, Dan Williams wrote:
>> Now, with recent device-dax extensions, it
>> also has a coarse grained memory management system for  physical
>> address-space partitioning and a path for struct-page-less backing for
>> VMs. What feature gaps remain vs dmemfs, and can those gaps be closed
>> with incremental improvements to the 2 existing memory-management
>> systems?
> 
> If I understand correctly, devm_memremap_pages() on ZONE_DEVICE memory
> would still create the "struct page" albeit lazily?  KVM then would use
> the usual get_user_pages() path.
> 
Correct.

The removal of struct page would be one of the added incremental improvements, like a
'map' with 'raw' sysfs attribute for dynamic dax regions that wouldn't online/create the
struct pages. The remaining plumbing (...)

> Looking more closely at the implementation of dmemfs, I don't understand
> is why dmemfs needs VM_DMEM etc. and cannot provide access to mmap-ed
> memory using remap_pfn_range and VM_PFNMAP, just like /dev/mem.  If it
> did that KVM would get physical addresses using fixup_user_fault and
> never need pfn_to_page() or get_user_pages().  I'm not saying that would
> instantly be an approval, but it would make remove a lot of hooks.
> 

(...) is similar to what you describe above. Albeit there's probably no need to do a
remap_pfn_range at mmap(), as DAX supplies a fault/huge_fault. Also, using that means it's
limited to a single contiguous PFN chunk.

KVM has the bits to make it work without struct pages, I don't think there's a need for
new pg/pfn_t/VM_* bits (aside from relying on {PFN,PAGE}_SPECIAL) as mentioned at the
start of the thread. I'm storing my wip here:

	https://github.com/jpemartins/linux pageless-dax

Which is based on the first series that had been submitted earlier this year:

	https://lore.kernel.org/kvm/20200110190313.17144-1-joao.m.martins@oracle.com/

  Joao
yulei zhang Oct. 20, 2020, 3:22 p.m. UTC | #11
On Tue, Oct 20, 2020 at 3:03 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 10/19/20 2:37 PM, Paolo Bonzini wrote:
> > On 15/10/20 00:25, Dan Williams wrote:
> >> Now, with recent device-dax extensions, it
> >> also has a coarse grained memory management system for  physical
> >> address-space partitioning and a path for struct-page-less backing for
> >> VMs. What feature gaps remain vs dmemfs, and can those gaps be closed
> >> with incremental improvements to the 2 existing memory-management
> >> systems?
> >
> > If I understand correctly, devm_memremap_pages() on ZONE_DEVICE memory
> > would still create the "struct page" albeit lazily?  KVM then would use
> > the usual get_user_pages() path.
> >
> Correct.
>
> The removal of struct page would be one of the added incremental improvements, like a
> 'map' with 'raw' sysfs attribute for dynamic dax regions that wouldn't online/create the
> struct pages. The remaining plumbing (...)
>
> > Looking more closely at the implementation of dmemfs, I don't understand
> > is why dmemfs needs VM_DMEM etc. and cannot provide access to mmap-ed
> > memory using remap_pfn_range and VM_PFNMAP, just like /dev/mem.  If it
> > did that KVM would get physical addresses using fixup_user_fault and
> > never need pfn_to_page() or get_user_pages().  I'm not saying that would
> > instantly be an approval, but it would make remove a lot of hooks.
> >
>
> (...) is similar to what you describe above. Albeit there's probably no need to do a
> remap_pfn_range at mmap(), as DAX supplies a fault/huge_fault. Also, using that means it's
> limited to a single contiguous PFN chunk.
>
> KVM has the bits to make it work without struct pages, I don't think there's a need for
> new pg/pfn_t/VM_* bits (aside from relying on {PFN,PAGE}_SPECIAL) as mentioned at the
> start of the thread. I'm storing my wip here:
>
>         https://github.com/jpemartins/linux pageless-dax
>
> Which is based on the first series that had been submitted earlier this year:
>
>         https://lore.kernel.org/kvm/20200110190313.17144-1-joao.m.martins@oracle.com/
>
>   Joao

Just as Joao mentioned, remap_pfn_range() may request a single
contiguous PFN range, which
is not our intention. And for VM_DMEM, I think we may drop it in the
next version, and to use the
existing bits as much as possible to minimize the modifications.