Message ID | cover.1602093760.git.yuleixzhang@tencent.com (mailing list archive) |
---|---|
Headers | show |
Series | Enhance memory utilization with DMEMFS | expand |
[adding a couple folks that directly or indirectly work on the subject] On 10/8/20 8:53 AM, yulei.kernel@gmail.com wrote: > From: Yulei Zhang <yuleixzhang@tencent.com> > > In current system each physical memory page is assocaited with > a page structure which is used to track the usage of this page. > But due to the memory usage rapidly growing in cloud environment, > we find the resource consuming for page structure storage becomes > highly remarkable. So is it an expense that we could spare? > Happy to see another person working to solve the same problem! I am really glad to see more folks being interested in solving this problem and I hope we can join efforts? BTW, there is also a second benefit in removing struct page - which is carving out memory from the direct map. > This patchset introduces an idea about how to save the extra > memory through a new virtual filesystem -- dmemfs. > > Dmemfs (Direct Memory filesystem) is device memory or reserved > memory based filesystem. This kind of memory is special as it > is not managed by kernel and most important it is without 'struct page'. > Therefore we can leverage the extra memory from the host system > to support more tenants in our cloud service. > This is like a walk down the memory lane. About a year ago we followed the same exact idea/motivation to have memory outside of the direct map (and removing struct page overhead) and started with our own layer/thingie. However we realized that DAX is one the subsystems which already gives you direct access to memory for free (and is already upstream), plus a couple of things which we found more handy. So we sent an RFC a couple months ago: https://lore.kernel.org/linux-mm/20200110190313.17144-1-joao.m.martins@oracle.com/ Since then majority of the work has been in improving DAX[1]. But now that is done I am going to follow up with the above patchset. [1] https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.stgit@dwillia2-desk3.amr.corp.intel.com/ (Give me a couple of days and I will send you the link to the latest patches on a git-tree - would love feedback!) The struct page removal for DAX would then be small, and ticks the same bells and whistles (MCE handling, reserving PAT memtypes, ptrace support) that we both do, with a smaller diffstat and it doesn't touch KVM (not at least fundamentally). 15 files changed, 401 insertions(+), 38 deletions(-) The things needed in core-mm is for handling PMD/PUD PAGE_SPECIAL much like we both do. Furthermore there wouldn't be a need for a new vm type, consuming an extra page bit (in addition to PAGE_SPECIAL) or new filesystem. [1] https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.stgit@dwillia2-desk3.amr.corp.intel.com/ > We uses a kernel boot parameter 'dmem=' to reserve the system > memory when the host system boots up, the details can be checked > in /Documentation/admin-guide/kernel-parameters.txt. > > Theoretically for each 4k physical page it can save 64 bytes if > we drop the 'struct page', so for guest memory with 320G it can > save about 5G physical memory totally. > Also worth mentioning that if you only care about 'struct page' cost, and not on the security boundary, there's also some work on hugetlbfs preallocation of hugepages into tricking vmemmap in reusing tail pages. https://lore.kernel.org/linux-mm/20200915125947.26204-1-songmuchun@bytedance.com/ Going forward that could also make sense for device-dax to avoid so many struct pages allocated (which would require its transition to compound struct pages like hugetlbfs which we are looking at too). In addition an idea <handwaving> would be perhaps to have a stricter mode in DAX where we initialize/use the metadata ('struct page') but remove the underlaying PFNs (of the 'struct page') from the direct map having to bear the cost of mapping/unmapping on gup/pup. Joao
Joao, thanks a lot for the feedback. One more thing needs to mention is that dmemfs also support fine-grained memory management which makes it more flexible for tenants with different requirements. On Fri, Oct 9, 2020 at 3:01 AM Joao Martins <joao.m.martins@oracle.com> wrote: > > [adding a couple folks that directly or indirectly work on the subject] > > On 10/8/20 8:53 AM, yulei.kernel@gmail.com wrote: > > From: Yulei Zhang <yuleixzhang@tencent.com> > > > > In current system each physical memory page is assocaited with > > a page structure which is used to track the usage of this page. > > But due to the memory usage rapidly growing in cloud environment, > > we find the resource consuming for page structure storage becomes > > highly remarkable. So is it an expense that we could spare? > > > Happy to see another person working to solve the same problem! > > I am really glad to see more folks being interested in solving > this problem and I hope we can join efforts? > > BTW, there is also a second benefit in removing struct page - > which is carving out memory from the direct map. > > > This patchset introduces an idea about how to save the extra > > memory through a new virtual filesystem -- dmemfs. > > > > Dmemfs (Direct Memory filesystem) is device memory or reserved > > memory based filesystem. This kind of memory is special as it > > is not managed by kernel and most important it is without 'struct page'. > > Therefore we can leverage the extra memory from the host system > > to support more tenants in our cloud service. > > > This is like a walk down the memory lane. > > About a year ago we followed the same exact idea/motivation to > have memory outside of the direct map (and removing struct page overhead) > and started with our own layer/thingie. However we realized that DAX > is one the subsystems which already gives you direct access to memory > for free (and is already upstream), plus a couple of things which we > found more handy. > > So we sent an RFC a couple months ago: > > https://lore.kernel.org/linux-mm/20200110190313.17144-1-joao.m.martins@oracle.com/ > > Since then majority of the work has been in improving DAX[1]. > But now that is done I am going to follow up with the above patchset. > > [1] > https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.stgit@dwillia2-desk3.amr.corp.intel.com/ > > (Give me a couple of days and I will send you the link to the latest > patches on a git-tree - would love feedback!) > > The struct page removal for DAX would then be small, and ticks the > same bells and whistles (MCE handling, reserving PAT memtypes, ptrace > support) that we both do, with a smaller diffstat and it doesn't > touch KVM (not at least fundamentally). > > 15 files changed, 401 insertions(+), 38 deletions(-) > > The things needed in core-mm is for handling PMD/PUD PAGE_SPECIAL much > like we both do. Furthermore there wouldn't be a need for a new vm type, > consuming an extra page bit (in addition to PAGE_SPECIAL) or new filesystem. > > [1] > https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.stgit@dwillia2-desk3.amr.corp.intel.com/ > > > > We uses a kernel boot parameter 'dmem=' to reserve the system > > memory when the host system boots up, the details can be checked > > in /Documentation/admin-guide/kernel-parameters.txt. > > > > Theoretically for each 4k physical page it can save 64 bytes if > > we drop the 'struct page', so for guest memory with 320G it can > > save about 5G physical memory totally. > > > Also worth mentioning that if you only care about 'struct page' cost, and not on the > security boundary, there's also some work on hugetlbfs preallocation of hugepages into > tricking vmemmap in reusing tail pages. > > https://lore.kernel.org/linux-mm/20200915125947.26204-1-songmuchun@bytedance.com/ > > Going forward that could also make sense for device-dax to avoid so many > struct pages allocated (which would require its transition to compound > struct pages like hugetlbfs which we are looking at too). In addition an > idea <handwaving> would be perhaps to have a stricter mode in DAX where > we initialize/use the metadata ('struct page') but remove the underlaying > PFNs (of the 'struct page') from the direct map having to bear the cost of > mapping/unmapping on gup/pup. > > Joao
On 10/9/20 12:39 PM, yulei zhang wrote: > Joao, thanks a lot for the feedback. One more thing needs to mention > is that dmemfs also support fine-grained > memory management which makes it more flexible for tenants with > different requirements. > So as DAX when it allows to partition a region (starting 5.10). Meaning you have a region which you dedicated to userspace. That region can then be partitioning into devices which give you access to multiple (possibly discontinuous) extents with at a given page granularity (selectable when you create the device), accessed through mmap(). You can then give that device to a cgroup. Or you can return that memory back to the kernel (should you run into OOM situation), or you recreate the same mappings across reboot/kexec. I probably need to read your patches again, but can you extend on the 'dmemfs also support fine-grained memory management' to understand what is the gap that you mention? > On Fri, Oct 9, 2020 at 3:01 AM Joao Martins <joao.m.martins@oracle.com> wrote: >> >> [adding a couple folks that directly or indirectly work on the subject] >> >> On 10/8/20 8:53 AM, yulei.kernel@gmail.com wrote: >>> From: Yulei Zhang <yuleixzhang@tencent.com> >>> >>> In current system each physical memory page is assocaited with >>> a page structure which is used to track the usage of this page. >>> But due to the memory usage rapidly growing in cloud environment, >>> we find the resource consuming for page structure storage becomes >>> highly remarkable. So is it an expense that we could spare? >>> >> Happy to see another person working to solve the same problem! >> >> I am really glad to see more folks being interested in solving >> this problem and I hope we can join efforts? >> >> BTW, there is also a second benefit in removing struct page - >> which is carving out memory from the direct map. >> >>> This patchset introduces an idea about how to save the extra >>> memory through a new virtual filesystem -- dmemfs. >>> >>> Dmemfs (Direct Memory filesystem) is device memory or reserved >>> memory based filesystem. This kind of memory is special as it >>> is not managed by kernel and most important it is without 'struct page'. >>> Therefore we can leverage the extra memory from the host system >>> to support more tenants in our cloud service. >>> >> This is like a walk down the memory lane. >> >> About a year ago we followed the same exact idea/motivation to >> have memory outside of the direct map (and removing struct page overhead) >> and started with our own layer/thingie. However we realized that DAX >> is one the subsystems which already gives you direct access to memory >> for free (and is already upstream), plus a couple of things which we >> found more handy. >> >> So we sent an RFC a couple months ago: >> >> https://lore.kernel.org/linux-mm/20200110190313.17144-1-joao.m.martins@oracle.com/ >> >> Since then majority of the work has been in improving DAX[1]. >> But now that is done I am going to follow up with the above patchset. >> >> [1] >> https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.stgit@dwillia2-desk3.amr.corp.intel.com/ >> >> (Give me a couple of days and I will send you the link to the latest >> patches on a git-tree - would love feedback!) >> >> The struct page removal for DAX would then be small, and ticks the >> same bells and whistles (MCE handling, reserving PAT memtypes, ptrace >> support) that we both do, with a smaller diffstat and it doesn't >> touch KVM (not at least fundamentally). >> >> 15 files changed, 401 insertions(+), 38 deletions(-) >> >> The things needed in core-mm is for handling PMD/PUD PAGE_SPECIAL much >> like we both do. Furthermore there wouldn't be a need for a new vm type, >> consuming an extra page bit (in addition to PAGE_SPECIAL) or new filesystem. >> >> [1] >> https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.stgit@dwillia2-desk3.amr.corp.intel.com/ >> >> >>> We uses a kernel boot parameter 'dmem=' to reserve the system >>> memory when the host system boots up, the details can be checked >>> in /Documentation/admin-guide/kernel-parameters.txt. >>> >>> Theoretically for each 4k physical page it can save 64 bytes if >>> we drop the 'struct page', so for guest memory with 320G it can >>> save about 5G physical memory totally. >>> >> Also worth mentioning that if you only care about 'struct page' cost, and not on the >> security boundary, there's also some work on hugetlbfs preallocation of hugepages into >> tricking vmemmap in reusing tail pages. >> >> https://lore.kernel.org/linux-mm/20200915125947.26204-1-songmuchun@bytedance.com/ >> >> Going forward that could also make sense for device-dax to avoid so many >> struct pages allocated (which would require its transition to compound >> struct pages like hugetlbfs which we are looking at too). In addition an >> idea <handwaving> would be perhaps to have a stricter mode in DAX where >> we initialize/use the metadata ('struct page') but remove the underlaying >> PFNs (of the 'struct page') from the direct map having to bear the cost of >> mapping/unmapping on gup/pup. >> >> Joao
On Fri, Oct 9, 2020 at 7:53 PM Joao Martins <joao.m.martins@oracle.com> wrote: > > On 10/9/20 12:39 PM, yulei zhang wrote: > > Joao, thanks a lot for the feedback. One more thing needs to mention > > is that dmemfs also support fine-grained > > memory management which makes it more flexible for tenants with > > different requirements. > > > So as DAX when it allows to partition a region (starting 5.10). Meaning you have a region > which you dedicated to userspace. That region can then be partitioning into devices which > give you access to multiple (possibly discontinuous) extents with at a given page > granularity (selectable when you create the device), accessed through mmap(). > You can then give that device to a cgroup. Or you can return that memory back to the > kernel (should you run into OOM situation), or you recreate the same mappings across > reboot/kexec. > > I probably need to read your patches again, but can you extend on the 'dmemfs also support > fine-grained memory management' to understand what is the gap that you mention? > sure, dmemfs uses bitmap to track the memory usage in the reserved memory region in a given page size granularity. And for each user the memory can be discrete as well. > > On Fri, Oct 9, 2020 at 3:01 AM Joao Martins <joao.m.martins@oracle.com> wrote: > >> > >> [adding a couple folks that directly or indirectly work on the subject] > >> > >> On 10/8/20 8:53 AM, yulei.kernel@gmail.com wrote: > >>> From: Yulei Zhang <yuleixzhang@tencent.com> > >>> > >>> In current system each physical memory page is assocaited with > >>> a page structure which is used to track the usage of this page. > >>> But due to the memory usage rapidly growing in cloud environment, > >>> we find the resource consuming for page structure storage becomes > >>> highly remarkable. So is it an expense that we could spare? > >>> > >> Happy to see another person working to solve the same problem! > >> > >> I am really glad to see more folks being interested in solving > >> this problem and I hope we can join efforts? > >> > >> BTW, there is also a second benefit in removing struct page - > >> which is carving out memory from the direct map. > >> > >>> This patchset introduces an idea about how to save the extra > >>> memory through a new virtual filesystem -- dmemfs. > >>> > >>> Dmemfs (Direct Memory filesystem) is device memory or reserved > >>> memory based filesystem. This kind of memory is special as it > >>> is not managed by kernel and most important it is without 'struct page'. > >>> Therefore we can leverage the extra memory from the host system > >>> to support more tenants in our cloud service. > >>> > >> This is like a walk down the memory lane. > >> > >> About a year ago we followed the same exact idea/motivation to > >> have memory outside of the direct map (and removing struct page overhead) > >> and started with our own layer/thingie. However we realized that DAX > >> is one the subsystems which already gives you direct access to memory > >> for free (and is already upstream), plus a couple of things which we > >> found more handy. > >> > >> So we sent an RFC a couple months ago: > >> > >> https://lore.kernel.org/linux-mm/20200110190313.17144-1-joao.m.martins@oracle.com/ > >> > >> Since then majority of the work has been in improving DAX[1]. > >> But now that is done I am going to follow up with the above patchset. > >> > >> [1] > >> https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.stgit@dwillia2-desk3.amr.corp.intel.com/ > >> > >> (Give me a couple of days and I will send you the link to the latest > >> patches on a git-tree - would love feedback!) > >> > >> The struct page removal for DAX would then be small, and ticks the > >> same bells and whistles (MCE handling, reserving PAT memtypes, ptrace > >> support) that we both do, with a smaller diffstat and it doesn't > >> touch KVM (not at least fundamentally). > >> > >> 15 files changed, 401 insertions(+), 38 deletions(-) > >> > >> The things needed in core-mm is for handling PMD/PUD PAGE_SPECIAL much > >> like we both do. Furthermore there wouldn't be a need for a new vm type, > >> consuming an extra page bit (in addition to PAGE_SPECIAL) or new filesystem. > >> > >> [1] > >> https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.stgit@dwillia2-desk3.amr.corp.intel.com/ > >> > >> > >>> We uses a kernel boot parameter 'dmem=' to reserve the system > >>> memory when the host system boots up, the details can be checked > >>> in /Documentation/admin-guide/kernel-parameters.txt. > >>> > >>> Theoretically for each 4k physical page it can save 64 bytes if > >>> we drop the 'struct page', so for guest memory with 320G it can > >>> save about 5G physical memory totally. > >>> > >> Also worth mentioning that if you only care about 'struct page' cost, and not on the > >> security boundary, there's also some work on hugetlbfs preallocation of hugepages into > >> tricking vmemmap in reusing tail pages. > >> > >> https://lore.kernel.org/linux-mm/20200915125947.26204-1-songmuchun@bytedance.com/ > >> > >> Going forward that could also make sense for device-dax to avoid so many > >> struct pages allocated (which would require its transition to compound > >> struct pages like hugetlbfs which we are looking at too). In addition an > >> idea <handwaving> would be perhaps to have a stricter mode in DAX where > >> we initialize/use the metadata ('struct page') but remove the underlaying > >> PFNs (of the 'struct page') from the direct map having to bear the cost of > >> mapping/unmapping on gup/pup. > >> > >> Joao
On 10/10/20 9:15 AM, yulei zhang wrote: > On Fri, Oct 9, 2020 at 7:53 PM Joao Martins <joao.m.martins@oracle.com> wrote: >> On 10/9/20 12:39 PM, yulei zhang wrote: >>> Joao, thanks a lot for the feedback. One more thing needs to mention >>> is that dmemfs also support fine-grained >>> memory management which makes it more flexible for tenants with >>> different requirements. >>> >> So as DAX when it allows to partition a region (starting 5.10). Meaning you have a region >> which you dedicated to userspace. That region can then be partitioning into devices which >> give you access to multiple (possibly discontinuous) extents with at a given page >> granularity (selectable when you create the device), accessed through mmap(). >> You can then give that device to a cgroup. Or you can return that memory back to the >> kernel (should you run into OOM situation), or you recreate the same mappings across >> reboot/kexec. >> >> I probably need to read your patches again, but can you extend on the 'dmemfs also support >> fine-grained memory management' to understand what is the gap that you mention? >> > sure, dmemfs uses bitmap to track the memory usage in the reserved > memory region in > a given page size granularity. And for each user the memory can be > discrete as well. > That same functionality of tracking reserved region usage across different users at any page granularity is covered the DAX series I mentioned below. The discrete part -- IIUC what you meant -- is then reduced using DAX ABI/tools to create a device file vs a filesystem. >>> On Fri, Oct 9, 2020 at 3:01 AM Joao Martins <joao.m.martins@oracle.com> wrote: >>>> >>>> [adding a couple folks that directly or indirectly work on the subject] >>>> >>>> On 10/8/20 8:53 AM, yulei.kernel@gmail.com wrote: >>>>> From: Yulei Zhang <yuleixzhang@tencent.com> >>>>> >>>>> In current system each physical memory page is assocaited with >>>>> a page structure which is used to track the usage of this page. >>>>> But due to the memory usage rapidly growing in cloud environment, >>>>> we find the resource consuming for page structure storage becomes >>>>> highly remarkable. So is it an expense that we could spare? >>>>> >>>> Happy to see another person working to solve the same problem! >>>> >>>> I am really glad to see more folks being interested in solving >>>> this problem and I hope we can join efforts? >>>> >>>> BTW, there is also a second benefit in removing struct page - >>>> which is carving out memory from the direct map. >>>> >>>>> This patchset introduces an idea about how to save the extra >>>>> memory through a new virtual filesystem -- dmemfs. >>>>> >>>>> Dmemfs (Direct Memory filesystem) is device memory or reserved >>>>> memory based filesystem. This kind of memory is special as it >>>>> is not managed by kernel and most important it is without 'struct page'. >>>>> Therefore we can leverage the extra memory from the host system >>>>> to support more tenants in our cloud service. >>>>> >>>> This is like a walk down the memory lane. >>>> >>>> About a year ago we followed the same exact idea/motivation to >>>> have memory outside of the direct map (and removing struct page overhead) >>>> and started with our own layer/thingie. However we realized that DAX >>>> is one the subsystems which already gives you direct access to memory >>>> for free (and is already upstream), plus a couple of things which we >>>> found more handy. >>>> >>>> So we sent an RFC a couple months ago: >>>> >>>> https://lore.kernel.org/linux-mm/20200110190313.17144-1-joao.m.martins@oracle.com/ >>>> >>>> Since then majority of the work has been in improving DAX[1]. >>>> But now that is done I am going to follow up with the above patchset. >>>> >>>> [1] >>>> https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.stgit@dwillia2-desk3.amr.corp.intel.com/ >>>> >>>> (Give me a couple of days and I will send you the link to the latest >>>> patches on a git-tree - would love feedback!) >>>> >>>> The struct page removal for DAX would then be small, and ticks the >>>> same bells and whistles (MCE handling, reserving PAT memtypes, ptrace >>>> support) that we both do, with a smaller diffstat and it doesn't >>>> touch KVM (not at least fundamentally). >>>> >>>> 15 files changed, 401 insertions(+), 38 deletions(-) >>>> >>>> The things needed in core-mm is for handling PMD/PUD PAGE_SPECIAL much >>>> like we both do. Furthermore there wouldn't be a need for a new vm type, >>>> consuming an extra page bit (in addition to PAGE_SPECIAL) or new filesystem. >>>> >>>> [1] >>>> https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.stgit@dwillia2-desk3.amr.corp.intel.com/ >>>> >>>> >>>>> We uses a kernel boot parameter 'dmem=' to reserve the system >>>>> memory when the host system boots up, the details can be checked >>>>> in /Documentation/admin-guide/kernel-parameters.txt. >>>>> >>>>> Theoretically for each 4k physical page it can save 64 bytes if >>>>> we drop the 'struct page', so for guest memory with 320G it can >>>>> save about 5G physical memory totally. >>>>> >>>> Also worth mentioning that if you only care about 'struct page' cost, and not on the >>>> security boundary, there's also some work on hugetlbfs preallocation of hugepages into >>>> tricking vmemmap in reusing tail pages. >>>> >>>> https://lore.kernel.org/linux-mm/20200915125947.26204-1-songmuchun@bytedance.com/ >>>> >>>> Going forward that could also make sense for device-dax to avoid so many >>>> struct pages allocated (which would require its transition to compound >>>> struct pages like hugetlbfs which we are looking at too). In addition an >>>> idea <handwaving> would be perhaps to have a stricter mode in DAX where >>>> we initialize/use the metadata ('struct page') but remove the underlaying >>>> PFNs (of the 'struct page') from the direct map having to bear the cost of >>>> mapping/unmapping on gup/pup. >>>> >>>> Joao
> -----Original Message----- > From: yulei.kernel@gmail.com [mailto:yulei.kernel@gmail.com] > Sent: Thursday, October 08, 2020 3:54 PM > To: akpm@linux-foundation.org; naoya.horiguchi@nec.com; > viro@zeniv.linux.org.uk; pbonzini@redhat.com > Cc: linux-fsdevel@vger.kernel.org; kvm@vger.kernel.org; > linux-kernel@vger.kernel.org; xiaoguangrong.eric@gmail.com; > kernellwp@gmail.com; lihaiwei.kernel@gmail.com; Yulei Zhang > Subject: [PATCH 00/35] Enhance memory utilization with DMEMFS > > From: Yulei Zhang <yuleixzhang@tencent.com> > > In current system each physical memory page is assocaited with > a page structure which is used to track the usage of this page. > But due to the memory usage rapidly growing in cloud environment, > we find the resource consuming for page structure storage becomes > highly remarkable. So is it an expense that we could spare? > > This patchset introduces an idea about how to save the extra > memory through a new virtual filesystem -- dmemfs. > > Dmemfs (Direct Memory filesystem) is device memory or reserved > memory based filesystem. This kind of memory is special as it > is not managed by kernel and most important it is without 'struct page'. > Therefore we can leverage the extra memory from the host system > to support more tenants in our cloud service. > > We uses a kernel boot parameter 'dmem=' to reserve the system > memory when the host system boots up, the details can be checked > in /Documentation/admin-guide/kernel-parameters.txt. > > Theoretically for each 4k physical page it can save 64 bytes if > we drop the 'struct page', so for guest memory with 320G it can > save about 5G physical memory totally. Sounds interesting, but seems your patch only support x86, have you considered aarch64? Regards Zengtao
On Mon, Oct 12, 2020 at 7:57 PM Zengtao (B) <prime.zeng@hisilicon.com> wrote: > > > > -----Original Message----- > > From: yulei.kernel@gmail.com [mailto:yulei.kernel@gmail.com] > > Sent: Thursday, October 08, 2020 3:54 PM > > To: akpm@linux-foundation.org; naoya.horiguchi@nec.com; > > viro@zeniv.linux.org.uk; pbonzini@redhat.com > > Cc: linux-fsdevel@vger.kernel.org; kvm@vger.kernel.org; > > linux-kernel@vger.kernel.org; xiaoguangrong.eric@gmail.com; > > kernellwp@gmail.com; lihaiwei.kernel@gmail.com; Yulei Zhang > > Subject: [PATCH 00/35] Enhance memory utilization with DMEMFS > > > > From: Yulei Zhang <yuleixzhang@tencent.com> > > > > In current system each physical memory page is assocaited with > > a page structure which is used to track the usage of this page. > > But due to the memory usage rapidly growing in cloud environment, > > we find the resource consuming for page structure storage becomes > > highly remarkable. So is it an expense that we could spare? > > > > This patchset introduces an idea about how to save the extra > > memory through a new virtual filesystem -- dmemfs. > > > > Dmemfs (Direct Memory filesystem) is device memory or reserved > > memory based filesystem. This kind of memory is special as it > > is not managed by kernel and most important it is without 'struct page'. > > Therefore we can leverage the extra memory from the host system > > to support more tenants in our cloud service. > > > > We uses a kernel boot parameter 'dmem=' to reserve the system > > memory when the host system boots up, the details can be checked > > in /Documentation/admin-guide/kernel-parameters.txt. > > > > Theoretically for each 4k physical page it can save 64 bytes if > > we drop the 'struct page', so for guest memory with 320G it can > > save about 5G physical memory totally. > > Sounds interesting, but seems your patch only support x86, have you > considered aarch64? > > Regards > Zengtao Thanks, so far we only verify it on x86 server, may extend to arm platform in the future.
On Mon, Oct 12, 2020 at 4:00 AM Joao Martins <joao.m.martins@oracle.com> wrote: [..] > On 10/10/20 9:15 AM, yulei zhang wrote: > > On Fri, Oct 9, 2020 at 7:53 PM Joao Martins <joao.m.martins@oracle.com> wrote: > >> On 10/9/20 12:39 PM, yulei zhang wrote: > >>> Joao, thanks a lot for the feedback. One more thing needs to mention > >>> is that dmemfs also support fine-grained > >>> memory management which makes it more flexible for tenants with > >>> different requirements. > >>> > >> So as DAX when it allows to partition a region (starting 5.10). Meaning you have a region > >> which you dedicated to userspace. That region can then be partitioning into devices which > >> give you access to multiple (possibly discontinuous) extents with at a given page > >> granularity (selectable when you create the device), accessed through mmap(). > >> You can then give that device to a cgroup. Or you can return that memory back to the > >> kernel (should you run into OOM situation), or you recreate the same mappings across > >> reboot/kexec. > >> > >> I probably need to read your patches again, but can you extend on the 'dmemfs also support > >> fine-grained memory management' to understand what is the gap that you mention? > >> > > sure, dmemfs uses bitmap to track the memory usage in the reserved > > memory region in > > a given page size granularity. And for each user the memory can be > > discrete as well. > > > That same functionality of tracking reserved region usage across different users at any > page granularity is covered the DAX series I mentioned below. The discrete part -- IIUC > what you meant -- is then reduced using DAX ABI/tools to create a device file vs a filesystem. Put another way. Linux already has a fine grained memory management system, the page allocator. Now, with recent device-dax extensions, it also has a coarse grained memory management system for physical address-space partitioning and a path for struct-page-less backing for VMs. What feature gaps remain vs dmemfs, and can those gaps be closed with incremental improvements to the 2 existing memory-management systems?
On 15/10/20 00:25, Dan Williams wrote: > Now, with recent device-dax extensions, it > also has a coarse grained memory management system for physical > address-space partitioning and a path for struct-page-less backing for > VMs. What feature gaps remain vs dmemfs, and can those gaps be closed > with incremental improvements to the 2 existing memory-management > systems? If I understand correctly, devm_memremap_pages() on ZONE_DEVICE memory would still create the "struct page" albeit lazily? KVM then would use the usual get_user_pages() path. Looking more closely at the implementation of dmemfs, I don't understand is why dmemfs needs VM_DMEM etc. and cannot provide access to mmap-ed memory using remap_pfn_range and VM_PFNMAP, just like /dev/mem. If it did that KVM would get physical addresses using fixup_user_fault and never need pfn_to_page() or get_user_pages(). I'm not saying that would instantly be an approval, but it would make remove a lot of hooks. Paolo
On 10/19/20 2:37 PM, Paolo Bonzini wrote: > On 15/10/20 00:25, Dan Williams wrote: >> Now, with recent device-dax extensions, it >> also has a coarse grained memory management system for physical >> address-space partitioning and a path for struct-page-less backing for >> VMs. What feature gaps remain vs dmemfs, and can those gaps be closed >> with incremental improvements to the 2 existing memory-management >> systems? > > If I understand correctly, devm_memremap_pages() on ZONE_DEVICE memory > would still create the "struct page" albeit lazily? KVM then would use > the usual get_user_pages() path. > Correct. The removal of struct page would be one of the added incremental improvements, like a 'map' with 'raw' sysfs attribute for dynamic dax regions that wouldn't online/create the struct pages. The remaining plumbing (...) > Looking more closely at the implementation of dmemfs, I don't understand > is why dmemfs needs VM_DMEM etc. and cannot provide access to mmap-ed > memory using remap_pfn_range and VM_PFNMAP, just like /dev/mem. If it > did that KVM would get physical addresses using fixup_user_fault and > never need pfn_to_page() or get_user_pages(). I'm not saying that would > instantly be an approval, but it would make remove a lot of hooks. > (...) is similar to what you describe above. Albeit there's probably no need to do a remap_pfn_range at mmap(), as DAX supplies a fault/huge_fault. Also, using that means it's limited to a single contiguous PFN chunk. KVM has the bits to make it work without struct pages, I don't think there's a need for new pg/pfn_t/VM_* bits (aside from relying on {PFN,PAGE}_SPECIAL) as mentioned at the start of the thread. I'm storing my wip here: https://github.com/jpemartins/linux pageless-dax Which is based on the first series that had been submitted earlier this year: https://lore.kernel.org/kvm/20200110190313.17144-1-joao.m.martins@oracle.com/ Joao
On Tue, Oct 20, 2020 at 3:03 AM Joao Martins <joao.m.martins@oracle.com> wrote: > > On 10/19/20 2:37 PM, Paolo Bonzini wrote: > > On 15/10/20 00:25, Dan Williams wrote: > >> Now, with recent device-dax extensions, it > >> also has a coarse grained memory management system for physical > >> address-space partitioning and a path for struct-page-less backing for > >> VMs. What feature gaps remain vs dmemfs, and can those gaps be closed > >> with incremental improvements to the 2 existing memory-management > >> systems? > > > > If I understand correctly, devm_memremap_pages() on ZONE_DEVICE memory > > would still create the "struct page" albeit lazily? KVM then would use > > the usual get_user_pages() path. > > > Correct. > > The removal of struct page would be one of the added incremental improvements, like a > 'map' with 'raw' sysfs attribute for dynamic dax regions that wouldn't online/create the > struct pages. The remaining plumbing (...) > > > Looking more closely at the implementation of dmemfs, I don't understand > > is why dmemfs needs VM_DMEM etc. and cannot provide access to mmap-ed > > memory using remap_pfn_range and VM_PFNMAP, just like /dev/mem. If it > > did that KVM would get physical addresses using fixup_user_fault and > > never need pfn_to_page() or get_user_pages(). I'm not saying that would > > instantly be an approval, but it would make remove a lot of hooks. > > > > (...) is similar to what you describe above. Albeit there's probably no need to do a > remap_pfn_range at mmap(), as DAX supplies a fault/huge_fault. Also, using that means it's > limited to a single contiguous PFN chunk. > > KVM has the bits to make it work without struct pages, I don't think there's a need for > new pg/pfn_t/VM_* bits (aside from relying on {PFN,PAGE}_SPECIAL) as mentioned at the > start of the thread. I'm storing my wip here: > > https://github.com/jpemartins/linux pageless-dax > > Which is based on the first series that had been submitted earlier this year: > > https://lore.kernel.org/kvm/20200110190313.17144-1-joao.m.martins@oracle.com/ > > Joao Just as Joao mentioned, remap_pfn_range() may request a single contiguous PFN range, which is not our intention. And for VM_DMEM, I think we may drop it in the next version, and to use the existing bits as much as possible to minimize the modifications.
From: Yulei Zhang <yuleixzhang@tencent.com> In current system each physical memory page is assocaited with a page structure which is used to track the usage of this page. But due to the memory usage rapidly growing in cloud environment, we find the resource consuming for page structure storage becomes highly remarkable. So is it an expense that we could spare? This patchset introduces an idea about how to save the extra memory through a new virtual filesystem -- dmemfs. Dmemfs (Direct Memory filesystem) is device memory or reserved memory based filesystem. This kind of memory is special as it is not managed by kernel and most important it is without 'struct page'. Therefore we can leverage the extra memory from the host system to support more tenants in our cloud service. We uses a kernel boot parameter 'dmem=' to reserve the system memory when the host system boots up, the details can be checked in /Documentation/admin-guide/kernel-parameters.txt. Theoretically for each 4k physical page it can save 64 bytes if we drop the 'struct page', so for guest memory with 320G it can save about 5G physical memory totally. Detailed usage of dmemfs is included in /Documentation/filesystem/dmemfs.rst. TODO: 1. we temporary disable the record_steal_time() before entering guest, will enable that after solve the conflict. 2. working on systemcall such as mincore, will update the status and patches soon. Yulei Zhang (35): fs: introduce dmemfs module mm: support direct memory reservation dmem: implement dmem memory management dmem: let pat recognize dmem dmemfs: support mmap dmemfs: support truncating inode down dmem: trace core functions dmem: show some statistic in debugfs dmemfs: support remote access dmemfs: introduce max_alloc_try_dpages parameter mm: export mempolicy interfaces to serve dmem allocator dmem: introduce mempolicy support mm, dmem: introduce PFN_DMEM and pfn_t_dmem mm, dmem: dmem-pmd vs thp-pmd mm: add pmd_special() check for pmd_trans_huge_lock() dmemfs: introduce ->split() to dmemfs_vm_ops mm, dmemfs: support unmap_page_range() for dmemfs pmd mm: follow_pmd_mask() for dmem huge pmd mm: gup_huge_pmd() for dmem huge pmd mm: support dmem huge pmd for vmf_insert_pfn_pmd() mm: support dmem huge pmd for follow_pfn() kvm, x86: Distinguish dmemfs page from mmio page kvm, x86: introduce VM_DMEM dmemfs: support hugepage for dmemfs mm, x86, dmem: fix estimation of reserved page for vaddr_get_pfn() mm, dmem: introduce pud_special() mm: add pud_special() to support dmem huge pud mm, dmemfs: support huge_fault() for dmemfs mm: add follow_pte_pud() dmem: introduce dmem_bitmap_alloc() and dmem_bitmap_free() dmem: introduce mce handler mm, dmemfs: register and handle the dmem mce kvm, x86: temporary disable record_steal_time for dmem dmem: add dmem unit tests Add documentation for dmemfs .../admin-guide/kernel-parameters.txt | 38 + Documentation/filesystems/dmemfs.rst | 59 + arch/x86/Kconfig | 1 + arch/x86/include/asm/pgtable.h | 32 +- arch/x86/include/asm/pgtable_types.h | 13 +- arch/x86/kernel/setup.c | 3 + arch/x86/kvm/mmu/mmu.c | 5 +- arch/x86/kvm/x86.c | 2 + arch/x86/mm/pat/memtype.c | 21 + drivers/vfio/vfio_iommu_type1.c | 4 + fs/Kconfig | 1 + fs/Makefile | 1 + fs/dmemfs/Kconfig | 16 + fs/dmemfs/Makefile | 8 + fs/dmemfs/inode.c | 1063 ++++++++++++++++ fs/dmemfs/trace.h | 54 + fs/inode.c | 6 + include/linux/dmem.h | 49 + include/linux/fs.h | 1 + include/linux/huge_mm.h | 5 +- include/linux/mempolicy.h | 3 + include/linux/mm.h | 9 + include/linux/pfn_t.h | 17 +- include/linux/pgtable.h | 22 + include/trace/events/dmem.h | 85 ++ include/uapi/linux/magic.h | 1 + mm/Kconfig | 21 + mm/Makefile | 1 + mm/dmem.c | 1075 +++++++++++++++++ mm/dmem_reserve.c | 303 +++++ mm/gup.c | 94 +- mm/huge_memory.c | 19 +- mm/memory-failure.c | 69 +- mm/memory.c | 74 +- mm/mempolicy.c | 4 +- mm/mprotect.c | 7 +- mm/mremap.c | 3 + tools/testing/dmem/Kbuild | 1 + tools/testing/dmem/Makefile | 10 + tools/testing/dmem/dmem-test.c | 184 +++ 40 files changed, 3336 insertions(+), 48 deletions(-) create mode 100644 Documentation/filesystems/dmemfs.rst create mode 100644 fs/dmemfs/Kconfig create mode 100644 fs/dmemfs/Makefile create mode 100644 fs/dmemfs/inode.c create mode 100644 fs/dmemfs/trace.h create mode 100644 include/linux/dmem.h create mode 100644 include/trace/events/dmem.h create mode 100644 mm/dmem.c create mode 100644 mm/dmem_reserve.c create mode 100644 tools/testing/dmem/Kbuild create mode 100644 tools/testing/dmem/Makefile create mode 100644 tools/testing/dmem/dmem-test.c