mbox series

[RFC,v3,00/10] Add support for shared PTEs across processes

Message ID 20240903232241.43995-1-anthony.yznaga@oracle.com (mailing list archive)
Headers show
Series Add support for shared PTEs across processes | expand

Message

Anthony Yznaga Sept. 3, 2024, 11:22 p.m. UTC
[Note: I have taken on the mshare work from Khalid. This series
is a continuation of the series last sent out in 2022[1] and not
of the later ptshare series.]

Memory pages shared between processes require page table entries
(PTEs) for each process. Each of these PTEs consume some of
the memory and as long as the number of mappings being maintained
is small enough, this space consumed by page tables is not
objectionable. When very few memory pages are shared between
processes, the number of PTEs to maintain is mostly constrained by
the number of pages of memory on the system. As the number of shared
pages and the number of times pages are shared goes up, amount of
memory consumed by page tables starts to become significant. This
issue does not apply to threads. Any number of threads can share the
same pages inside a process while sharing the same PTEs. Extending
this same model to sharing pages across processes can eliminate this
issue for sharing across processes as well.

Some of the field deployments commonly see memory pages shared
across 1000s of processes. On x86_64, each page requires a PTE that
is 8 bytes long which is very small compared to the 4K page
size. When 2000 processes map the same page in their address space,
each one of them requires 8 bytes for its PTE and together that adds
up to 8K of memory just to hold the PTEs for one 4K page. On a
database server with 300GB SGA, a system crash was seen with
out-of-memory condition when 1500+ clients tried to share this SGA
even though the system had 512GB of memory. On this server, in the
worst case scenario of all 1500 processes mapping every page from
SGA would have required 878GB+ for just the PTEs. If these PTEs
could be shared, the a substantial amount of memory saved.

This patch series implements a mechanism that allows userspace
processes to opt into sharing PTEs. It adds a new in-memory
filesystem - msharefs. A file created on msharefs represents a
shared region where all processes mapping that region will map
objects within it with shared PTEs. When the file is created,
a new host mm struct is created to hold the shared page tables
and vmas for objects later mapped into the shared region. This
host mm struct is associated with the file and not with a task.
When a process mmap's the shared region, a vm flag VM_SHARED_PT
is added to the vma. On page fault the vma is checked for the
presence of the VM_SHARED_PT flag. If found, the host mm is
searched for a vma that covers the fault address. Fault handling
then continues using that host vma which establishes PTEs in the
host mm. Fault handling in a shared region also links the shared
page table to the process page table if the shared page table
already exists.

Ioctls are used to set/get the start address and size of the host
mm, to map objects into the shared region, and to (eventually)
perform other operations on the shared objects such as changing
protections.

One major issue to address for this series to function correctly
is how to ensure proper TLB flushing when a page in a shared
region is unmapped. For example, since the rmaps for pages in a
shared region map back to host vmas which point to a host mm, TLB
flushes won't be directed to the CPUs the sharing processes have
run on. I am by no means an expert in this area. One idea is to
install a mmu_notifier on the host mm that can gather the necessary
data and do flushes similar to the batch flushing.


API
===

mshare does not introduce a new API. It instead uses existing APIs
to implement page table sharing. The steps to use this feature are:

1. Mount msharefs on /sys/fs/mshare -
        mount -t msharefs msharefs /sys/fs/mshare

2. mshare regions have alignment and size requirements. Start
   address for the region must be aligned to an address boundary and
   be a multiple of fixed size. This alignment and size requirement
   can be obtained by reading the file /sys/fs/mshare/mshare_info
   which returns a number in text format. mshare regions must be
   aligned to this boundary and be a multiple of this size.

3. For the process creating an mshare region:
        a. Create a file on /sys/fs/mshare, for example -
                fd = open("/sys/fs/mshare/shareme",
                                O_RDWR|O_CREAT|O_EXCL, 0600);

        b. Establish the starting address and size of the region
                struct mshare_info minfo;

                minfo.start = TB(2);
                minfo.size = BUFFER_SIZE;
                ioctl(fd, MSHAREFS_SET_SIZE, &minfo)

        c. Map some memory in the region
                struct mshare_create mcreate;

                mcreate.addr = TB(2);
                mcreate.size = BUFFER_SIZE;
                mcreate.offset = 0;
                mcreate.prot = PROT_READ | PROT_WRITE;
                mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
                mcreate.fd = -1;

                ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate)

        d. Map the mshare region into the process
                mmap((void *)TB(2), BUF_SIZE, PROT_READ | PROT_WRITE,
                        MAP_SHARED, fd, 0);

        e. Write and read to mshared region normally.

4. For processes attaching an mshare region:
        a. Open the file on msharefs, for example -
                fd = open("/sys/fs/mshare/shareme", O_RDWR);

        b. Get information about mshare'd region from the file:
                struct mshare_info minfo;

                ioctl(fd, MSHAREFS_GET_SIZE, &minfo);
        
        c. Map the mshare'd region into the process
                mmap(minfo.start, minfo.size,
                        PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

5. To delete the mshare region -
                unlink("/sys/fs/mshare/shareme");



Example Code
============

Snippet of the code that a donor process would run looks like below:

-----------------
        struct mshare_info minfo;
        struct mshare_create mcreate;

        fd = open("/sys/fs/mshare/mshare_info", O_RDONLY);
        read(fd, req, 128);
        alignsize = atoi(req);
        close(fd);
        fd = open("/sys/fs/mshare/shareme", O_RDWR|O_CREAT|O_EXCL, 0600);
        start = alignsize * 4;
        size = alignsize * 2;

        minfo.start = start;
        minfo.size = size;
        ret = ioctl(fd, MSHAREFS_SET_SIZE, &minfo);
        if (ret < 0)
                perror("ERROR: MSHAREFS_SET_SIZE");

        mcreate.addr = start;
        mcreate.size = size;
        mcreate.offset = 0;
        mcreate.prot = PROT_READ | PROT_WRITE;
        mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED;
        mcreate.fd = -1;
        ret = ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate);
        if (ret < 0)
                perror("ERROR: MSHAREFS_CREATE_MAPPING");

        addr = mmap((void *)start, size, PROT_READ | PROT_WRITE,
                        MAP_SHARED, fd, 0);
        if (addr == MAP_FAILED)
                perror("ERROR: mmap failed");

        strncpy(addr, "Some random shared text",
                        sizeof("Some random shared text"));
-----------------


Snippet of code that a consumer process would execute looks like:

-----------------
        struct mshare_info minfo;

        fd = open("/sys/fs/mshare/shareme", O_RDONLY);

        ret = ioctl(fd, MSHAREFS_GET_SIZE, &minfo);
        if (ret < 0)
                perror("ERROR: MSHAREFS_GET_SIZE");
        if (!minfo.start)
                perror("ERROR: mshare region not init'd");

        addr = mmap(minfo.start, minfo.size, PROT_READ | PROT_WRITE,
                        MAP_SHARED, fd, 0);

        printf("Guest mmap at %px:\n", addr);
        printf("%s\n", addr);
        printf("\nDone\n");

-----------------


v3 -> v2:
        - Now based on 6.11-rc5
        - Addressed many comments from v2.
        - Simplified filesystem code. Removed refcounting of the
          shared mm_struct allocated for an mshare file. The mm_struct
          and the pagetables and mappings it contains are freed when
          the inode is evicted.
        - Switched to an ioctl-based interface. Ioctls implemented
          are used to set and get the start address and size of an
          mshare region and to map objects into an mshare region
          (only anon shared memory is supported in this series).
        - Updated example code

[1] v2: https://lore.kernel.org/linux-mm/cover.1656531090.git.khalid.aziz@oracle.com/


v1 -> v2:
        - Eliminated mshare and mshare_unlink system calls and
          replaced API with standard mmap and unlink (Based upon
          v1 patch discussions and LSF/MM discussions)
        - All fd based API (based upon feedback and suggestions from
          Andy Lutomirski, Eric Biederman, Kirill and others)
        - Added a file /sys/fs/mshare/mshare_info to provide
          alignment and size requirement info (based upon feedback
          from Dave Hansen, Mark Hemment and discussions at LSF/MM)
        - Addressed TODOs in v1
        - Added support for directories in msharefs
        - Added locks around any time vma is touched (Dave Hansen)
        - Eliminated the need to point vm_mm in original vmas to the
          newly synthesized mshare mm
        - Ensured mmap_read_unlock is called for correct mm in
          handle_mm_fault (Dave Hansen)


Anthony Yznaga (3):
  mm/mshare: allocate an mm_struct for msharefs files
  mm: create __do_mmap() to take an mm_struct * arg
  mshare: add MSHAREFS_CREATE_MAPPING

Khalid Aziz (7):
  mm: Add msharefs filesystem
  mm/mshare: pre-populate msharefs with information file
  mm/mshare: make msharefs writable and support directories
  mm/mshare: Add ioctl support
  mm/mshare: Add vm flag for shared PTEs
  mm/mshare: Add mmap support
  mm/mshare: Add basic page table sharing support

 Documentation/filesystems/msharefs.rst        | 107 ++++
 .../userspace-api/ioctl/ioctl-number.rst      |   1 +
 include/linux/mm.h                            |  25 +-
 include/trace/events/mmflags.h                |   3 +
 include/uapi/linux/magic.h                    |   1 +
 include/uapi/linux/msharefs.h                 |  38 ++
 mm/Makefile                                   |   2 +-
 mm/internal.h                                 |   6 +
 mm/memory.c                                   |  62 +-
 mm/mmap.c                                     |  12 +-
 mm/mshare.c                                   | 582 ++++++++++++++++++
 11 files changed, 827 insertions(+), 12 deletions(-)
 create mode 100644 Documentation/filesystems/msharefs.rst
 create mode 100644 include/uapi/linux/msharefs.h
 create mode 100644 mm/mshare.c

Comments

Dave Hansen Oct. 2, 2024, 5:35 p.m. UTC | #1
We were just chatting about this on David Rientjes's MM alignment call.
I thought I'd try to give a little brain

Let's start by thinking about KVM and secondary MMUs.  KVM has a primary
mm: the QEMU (or whatever) process mm.  The virtualization (EPT/NPT)
tables get entries that effectively mirror the primary mm page tables
and constitute a secondary MMU.  If the primary page tables change,
mmu_notifiers ensure that the changes get reflected into the
virtualization tables and also that the virtualization paging structure
caches are flushed.

msharefs is doing something very similar.  But, in the msharefs case,
the secondary MMUs are actually normal CPU MMUs.  The page tables are
normal old page tables and the caches are the normal old TLB.  That's
what makes it so confusing: we have lots of infrastructure for dealing
with that "stuff" (CPU page tables and TLB), but msharefs has
short-circuited the infrastructure and it doesn't work any more.

Basically, I think it makes a lot of sense to check what KVM (or another
mmu_notifier user) is doing and make sure that msharefs is following its
lead.  For instance, KVM _should_ have the exact same "page free"
flushing issue where it gets the MMU notifier call but the page may
still be in the secondary MMU.  I _think_ KVM fixes it with an extra
page refcount that it takes when it first walks the primary page tables.

But the short of it is that the msharefs host mm represents a "secondary
MMU".  I don't think it is really that special of an MMU other than the
fact that it has an mm_struct.
Anthony Yznaga Oct. 2, 2024, 7:30 p.m. UTC | #2
On 10/2/24 10:35 AM, Dave Hansen wrote:
> We were just chatting about this on David Rientjes's MM alignment call.
> I thought I'd try to give a little brain
>
> Let's start by thinking about KVM and secondary MMUs.  KVM has a primary
> mm: the QEMU (or whatever) process mm.  The virtualization (EPT/NPT)
> tables get entries that effectively mirror the primary mm page tables
> and constitute a secondary MMU.  If the primary page tables change,
> mmu_notifiers ensure that the changes get reflected into the
> virtualization tables and also that the virtualization paging structure
> caches are flushed.
>
> msharefs is doing something very similar.  But, in the msharefs case,
> the secondary MMUs are actually normal CPU MMUs.  The page tables are
> normal old page tables and the caches are the normal old TLB.  That's
> what makes it so confusing: we have lots of infrastructure for dealing
> with that "stuff" (CPU page tables and TLB), but msharefs has
> short-circuited the infrastructure and it doesn't work any more.
>
> Basically, I think it makes a lot of sense to check what KVM (or another
> mmu_notifier user) is doing and make sure that msharefs is following its
> lead.  For instance, KVM _should_ have the exact same "page free"
> flushing issue where it gets the MMU notifier call but the page may
> still be in the secondary MMU.  I _think_ KVM fixes it with an extra
> page refcount that it takes when it first walks the primary page tables.
>
> But the short of it is that the msharefs host mm represents a "secondary
> MMU".  I don't think it is really that special of an MMU other than the
> fact that it has an mm_struct.


Thanks, Dave. This is helpful. I'll look at what other mmu notifier 
users are doing. This does align with the comments in mmu_notifier.h 
regarding invalidate_range_start/end.


Anthony
Dave Hansen Oct. 2, 2024, 11:11 p.m. UTC | #3
About TLB flushing...

The quick and dirty thing to do is just flush_tlb_all() after you remove
the PTE from the host mm.  That will surely work everywhere and it's as
dirt simple as you get.  Honestly, it might even be cheaper than the
alternative.

Also, I don't think PCIDs actually complicate the problem at all.  We
basically do remote mm TLB flushes using two mechanisms:

	1. If the mm is loaded, use INVLPG and friends to zap the TLB
	2. Bump mm->context.tlb_gen so that the next time it _gets_
	   loaded, the TLB is flushed.

flush_tlb_func() really only cares about #1 since if the mm isn't
loaded, it'll get flushed anyway at the next context switch.

The alternatives I can think of:

Make flush_tlb_mm_range(host_mm) work somehow.  You'd need to somehow
keep mm_cpumask(host_mm) up to date and also make do something to
flush_tlb_func() to tell it that 'loaded_mm' isn't relevant and it
should flush regardless.

The other way is to use the msharefs's inode ->i_mmap to find all the
VMAs mapping the file, and find all *their* mm's:

	for each vma in inode->i_mmap
		mm = vma->vm_mm
		flush_tlb_mm_range(<vma range here>)

But that might be even worse than flush_tlb_all() because it might end
up sending more than one IPI per CPU.

You can fix _that_ by keeping a single cpumask that you build up:

	mask = 0
	for each vma in inode->i_mmap
		mm = vma->vm_mm
		mask |= mm_cpumask(mm)

	flush_tlb_multi(mask, info);

Unfortunately, 'info->mm' needs to be more than one mm, so you probably
still need a new flush_tlb_func() flush type to tell it to ignore
'info->mm' and flush anyway.

After all that, I kinda like flush_tlb_all(). ;)
Anthony Yznaga Oct. 3, 2024, 12:24 a.m. UTC | #4
On 10/2/24 4:11 PM, Dave Hansen wrote:
> About TLB flushing...
>
> The quick and dirty thing to do is just flush_tlb_all() after you remove
> the PTE from the host mm.  That will surely work everywhere and it's as
> dirt simple as you get.  Honestly, it might even be cheaper than the
> alternative.

I think this a good place to start from. My concern is that unnecessary 
flushes will potentially impact unrelated loads. Performance testing as 
things progress can help determine if a more involved approach is needed.

>
> Also, I don't think PCIDs actually complicate the problem at all.  We
> basically do remote mm TLB flushes using two mechanisms:
>
> 	1. If the mm is loaded, use INVLPG and friends to zap the TLB
> 	2. Bump mm->context.tlb_gen so that the next time it _gets_
> 	   loaded, the TLB is flushed.
>
> flush_tlb_func() really only cares about #1 since if the mm isn't
> loaded, it'll get flushed anyway at the next context switch.
>
> The alternatives I can think of:
>
> Make flush_tlb_mm_range(host_mm) work somehow.  You'd need to somehow
> keep mm_cpumask(host_mm) up to date and also make do something to
> flush_tlb_func() to tell it that 'loaded_mm' isn't relevant and it
> should flush regardless.
>
> The other way is to use the msharefs's inode ->i_mmap to find all the
> VMAs mapping the file, and find all *their* mm's:
>
> 	for each vma in inode->i_mmap
> 		mm = vma->vm_mm
> 		flush_tlb_mm_range(<vma range here>)
>
> But that might be even worse than flush_tlb_all() because it might end
> up sending more than one IPI per CPU.
>
> You can fix _that_ by keeping a single cpumask that you build up:
>
> 	mask = 0
> 	for each vma in inode->i_mmap
> 		mm = vma->vm_mm
> 		mask |= mm_cpumask(mm)
>
> 	flush_tlb_multi(mask, info);
>
> Unfortunately, 'info->mm' needs to be more than one mm, so you probably
> still need a new flush_tlb_func() flush type to tell it to ignore
> 'info->mm' and flush anyway.

What about something like arch_tlbbatch_flush() which sets info->mm to 
NULL and uses a batch cpumask? That seems perfect though there would be 
a bit more work needed to ensure things work on other architectures. 
flush_tlb_all() it is then, for now. :-)

>
> After all that, I kinda like flush_tlb_all(). ;)
David Hildenbrand Oct. 7, 2024, 8:44 a.m. UTC | #5
On 02.10.24 19:35, Dave Hansen wrote:
> We were just chatting about this on David Rientjes's MM alignment call.

Unfortunately I was not able to attend this time, my body decided it's a 
good idea to stay in bed for a couple of days.

> I thought I'd try to give a little brain
> 
> Let's start by thinking about KVM and secondary MMUs.  KVM has a primary
> mm: the QEMU (or whatever) process mm.  The virtualization (EPT/NPT)
> tables get entries that effectively mirror the primary mm page tables
> and constitute a secondary MMU.  If the primary page tables change,
> mmu_notifiers ensure that the changes get reflected into the
> virtualization tables and also that the virtualization paging structure
> caches are flushed.
> 
> msharefs is doing something very similar.  But, in the msharefs case,
> the secondary MMUs are actually normal CPU MMUs.  The page tables are
> normal old page tables and the caches are the normal old TLB.  That's
> what makes it so confusing: we have lots of infrastructure for dealing
> with that "stuff" (CPU page tables and TLB), but msharefs has
> short-circuited the infrastructure and it doesn't work any more.

It's quite different IMHO, to a degree that I believe they are different 
beasts:

Secondary MMUs:
* "Belongs" to same MM context and the primary MMU (process page tables)
* Maintains separate tables/PTEs, in completely separate page table
   hierarchy
* Notifiers make sure the secondary structure stays in sync (update
   PTEs, flush TLB)

mshare:
* Possibly mapped by many different MMs. Likely nothing stops us from
   having on MM map multiple different mshare fds/
* Updating the PTEs directly affects all other MM page table structures
   (and possibly any secondary MMUs! scary)


I better not think about the complexity of seconary MMUs + mshare (e.g., 
KVM with mshare in guest memory): MMU notifiers for all MMs must be 
called ...


> 
> Basically, I think it makes a lot of sense to check what KVM (or another
> mmu_notifier user) is doing and make sure that msharefs is following its
> lead.  For instance, KVM _should_ have the exact same "page free"
> flushing issue where it gets the MMU notifier call but the page may
> still be in the secondary MMU.  I _think_ KVM fixes it with an extra
> page refcount that it takes when it first walks the primary page tables.
> 
> But the short of it is that the msharefs host mm represents a "secondary
> MMU".  I don't think it is really that special of an MMU other than the
> fact that it has an mm_struct.

Not sure I agree ... IMHO these are two orthogonal things. Unless we 
want MMU notifiers to "update" MM primary MMUs (there is not really 
anything to update ...), but not sure if that is what we are looking for.

What you note about TLB flushing in the other mail makes sense, not sure 
how this interacts with any secondary MMUs ....
David Hildenbrand Oct. 7, 2024, 8:48 a.m. UTC | #6
On 02.10.24 19:35, Dave Hansen wrote:
> We were just chatting about this on David Rientjes's MM alignment call.
> I thought I'd try to give a little brain
> 
> Let's start by thinking about KVM and secondary MMUs.  KVM has a primary
> mm: the QEMU (or whatever) process mm.  The virtualization (EPT/NPT)
> tables get entries that effectively mirror the primary mm page tables
> and constitute a secondary MMU.  If the primary page tables change,
> mmu_notifiers ensure that the changes get reflected into the
> virtualization tables and also that the virtualization paging structure
> caches are flushed.
> 
> msharefs is doing something very similar.  But, in the msharefs case,
> the secondary MMUs are actually normal CPU MMUs.  The page tables are
> normal old page tables and the caches are the normal old TLB.  That's
> what makes it so confusing: we have lots of infrastructure for dealing
> with that "stuff" (CPU page tables and TLB), but msharefs has
> short-circuited the infrastructure and it doesn't work any more.
> 
> Basically, I think it makes a lot of sense to check what KVM (or another
> mmu_notifier user) is doing and make sure that msharefs is following its
> lead.  For instance, KVM _should_ have the exact same "page free"
> flushing issue where it gets the MMU notifier call but the page may
> still be in the secondary MMU.  I _think_ KVM fixes it with an extra
> page refcount that it takes when it first walks the primary page tables.

Forgot to comment on this (brain still recovering ...).

KVM only grabs a temporary reference, and drops that reference once the 
secondary MMU PTE was updated (present PTE installed). The notifiers on 
primary MMU changes (e.g., unmap) take care of any TLB invalidation 
before the primary MMU let's go of the page and the refcount is dropped.
Kirill A. Shutemov Oct. 7, 2024, 9:01 a.m. UTC | #7
On Tue, Sep 03, 2024 at 04:22:31PM -0700, Anthony Yznaga wrote:
> This patch series implements a mechanism that allows userspace
> processes to opt into sharing PTEs. It adds a new in-memory
> filesystem - msharefs. A file created on msharefs represents a
> shared region where all processes mapping that region will map
> objects within it with shared PTEs. When the file is created,
> a new host mm struct is created to hold the shared page tables
> and vmas for objects later mapped into the shared region. This
> host mm struct is associated with the file and not with a task.

Taskless mm_struct can be problematic. Like, we don't have access to it's
counters because it is not represented in /proc. For instance, there's no
way to check its smaps.

Also, I *think* it is immune to oom-killer because oom-killer looks for a
victim task, not mm.
I hope it is not an intended feature :P

> When a process mmap's the shared region, a vm flag VM_SHARED_PT
> is added to the vma. On page fault the vma is checked for the
> presence of the VM_SHARED_PT flag.

I think it is wrong approach.

Instead of spaying VM_SHARED_PT checks across core-mm, we need to add a
generic hooks that can be used by mshare and hugetlb. And remove
is_vm_hugetlb_page() check from core-mm along the way.

BTW, is_vm_hugetlb_page() callsites seem to be the indicator to check if
mshare has to do something differently there. I feel you miss a lot of
such cases.
Dave Hansen Oct. 7, 2024, 3:58 p.m. UTC | #8
On 10/7/24 01:44, David Hildenbrand wrote:
> On 02.10.24 19:35, Dave Hansen wrote:
>> We were just chatting about this on David Rientjes's MM alignment call.
> 
> Unfortunately I was not able to attend this time, my body decided it's a
> good idea to stay in bed for a couple of days.
> 
>> I thought I'd try to give a little brain
>>
>> Let's start by thinking about KVM and secondary MMUs.  KVM has a primary
>> mm: the QEMU (or whatever) process mm.  The virtualization (EPT/NPT)
>> tables get entries that effectively mirror the primary mm page tables
>> and constitute a secondary MMU.  If the primary page tables change,
>> mmu_notifiers ensure that the changes get reflected into the
>> virtualization tables and also that the virtualization paging structure
>> caches are flushed.
>>
>> msharefs is doing something very similar.  But, in the msharefs case,
>> the secondary MMUs are actually normal CPU MMUs.  The page tables are
>> normal old page tables and the caches are the normal old TLB.  That's
>> what makes it so confusing: we have lots of infrastructure for dealing
>> with that "stuff" (CPU page tables and TLB), but msharefs has
>> short-circuited the infrastructure and it doesn't work any more.
> 
> It's quite different IMHO, to a degree that I believe they are different
> beasts:
> 
> Secondary MMUs:
> * "Belongs" to same MM context and the primary MMU (process page tables)

I think you're speaking to the ratio here.  For each secondary MMU, I
think you're saying that there's one and only one mm_struct.  Is that right?

> * Maintains separate tables/PTEs, in completely separate page table
>   hierarchy

This is the case for KVM and the VMX/SVM MMUs, but it's not generally
true about hardware.  IOMMUs can walk x86 page tables and populate the
IOTLB from the _same_ page table hierarchy as the CPU.
David Hildenbrand Oct. 7, 2024, 4:27 p.m. UTC | #9
On 07.10.24 17:58, Dave Hansen wrote:
> On 10/7/24 01:44, David Hildenbrand wrote:
>> On 02.10.24 19:35, Dave Hansen wrote:
>>> We were just chatting about this on David Rientjes's MM alignment call.
>>
>> Unfortunately I was not able to attend this time, my body decided it's a
>> good idea to stay in bed for a couple of days.
>>
>>> I thought I'd try to give a little brain
>>>
>>> Let's start by thinking about KVM and secondary MMUs.  KVM has a primary
>>> mm: the QEMU (or whatever) process mm.  The virtualization (EPT/NPT)
>>> tables get entries that effectively mirror the primary mm page tables
>>> and constitute a secondary MMU.  If the primary page tables change,
>>> mmu_notifiers ensure that the changes get reflected into the
>>> virtualization tables and also that the virtualization paging structure
>>> caches are flushed.
>>>
>>> msharefs is doing something very similar.  But, in the msharefs case,
>>> the secondary MMUs are actually normal CPU MMUs.  The page tables are
>>> normal old page tables and the caches are the normal old TLB.  That's
>>> what makes it so confusing: we have lots of infrastructure for dealing
>>> with that "stuff" (CPU page tables and TLB), but msharefs has
>>> short-circuited the infrastructure and it doesn't work any more.
>>
>> It's quite different IMHO, to a degree that I believe they are different
>> beasts:
>>
>> Secondary MMUs:
>> * "Belongs" to same MM context and the primary MMU (process page tables)
> 
> I think you're speaking to the ratio here.  For each secondary MMU, I
> think you're saying that there's one and only one mm_struct.  Is that right?

Yes, that is my understanding (at least with KVM). It's a secondary MMU 
derived from exactly one primary MMU (MM context -> page table hierarchy).

The sophisticated ( :) ) notifier mechanism when updating the primary 
MMU will result in keeping the secondary MMU in sync (of course, what to 
sync, and how to, depends in KVM on the memory slot that define how the 
guest physical memory layout is derived from the process virtual address 
space).

> 
>> * Maintains separate tables/PTEs, in completely separate page table
>>    hierarchy
> 
> This is the case for KVM and the VMX/SVM MMUs, but it's not generally
> true about hardware.  IOMMUs can walk x86 page tables and populate the
> IOTLB from the _same_ page table hierarchy as the CPU.

Yes, of course.
Sean Christopherson Oct. 7, 2024, 4:45 p.m. UTC | #10
On Mon, Oct 07, 2024, David Hildenbrand wrote:
> On 07.10.24 17:58, Dave Hansen wrote:
> > On 10/7/24 01:44, David Hildenbrand wrote:
> > > On 02.10.24 19:35, Dave Hansen wrote:
> > > > We were just chatting about this on David Rientjes's MM alignment call.
> > > 
> > > Unfortunately I was not able to attend this time, my body decided it's a
> > > good idea to stay in bed for a couple of days.
> > > 
> > > > I thought I'd try to give a little brain
> > > > 
> > > > Let's start by thinking about KVM and secondary MMUs.  KVM has a primary
> > > > mm: the QEMU (or whatever) process mm.  The virtualization (EPT/NPT)
> > > > tables get entries that effectively mirror the primary mm page tables
> > > > and constitute a secondary MMU.  If the primary page tables change,
> > > > mmu_notifiers ensure that the changes get reflected into the
> > > > virtualization tables and also that the virtualization paging structure
> > > > caches are flushed.
> > > > 
> > > > msharefs is doing something very similar.  But, in the msharefs case,
> > > > the secondary MMUs are actually normal CPU MMUs.  The page tables are
> > > > normal old page tables and the caches are the normal old TLB.  That's
> > > > what makes it so confusing: we have lots of infrastructure for dealing
> > > > with that "stuff" (CPU page tables and TLB), but msharefs has
> > > > short-circuited the infrastructure and it doesn't work any more.
> > > 
> > > It's quite different IMHO, to a degree that I believe they are different
> > > beasts:
> > > 
> > > Secondary MMUs:
> > > * "Belongs" to same MM context and the primary MMU (process page tables)
> > 
> > I think you're speaking to the ratio here.  For each secondary MMU, I
> > think you're saying that there's one and only one mm_struct.  Is that right?
> 
> Yes, that is my understanding (at least with KVM). It's a secondary MMU
> derived from exactly one primary MMU (MM context -> page table hierarchy).

I don't think the ratio is what's important.  I think the important takeaway is
that the secondary MMU is explicitly tied to the primary MMU that it is tracking.
This is enforced in code, as the list of mmu_notifiers is stored in mm_struct.

The 1:1 ratio probably holds true today, e.g. for KVM, each VM is associated with
exactly one mm_struct.  But fundamentally, nothing would prevent a secondary MMU
that manages a so called software TLB from tracking multiple primary MMUs.

E.g. it wouldn't be all that hard to implement in KVM (a bit crazy, but not hard),
because KVM's memslots disallow gfn aliases, i.e. each index into KVM's secondary
MMU would be associated with at most one VMA and thus mm_struct.

Pulling Dave's earlier comment in:

 : But the short of it is that the msharefs host mm represents a "secondary
 : MMU".  I don't think it is really that special of an MMU other than the
 : fact that it has an mm_struct.

and David's (so. many. Davids):

 : I better not think about the complexity of seconary MMUs + mshare (e.g.,
 : KVM with mshare in guest memory): MMU notifiers for all MMs must be
 : called ...

mshare() is unique because it creates the possibly of chained "secondary" MMUs.
I.e. the fact that it has an mm_struct makes it *very* special, IMO.

> > > * Maintains separate tables/PTEs, in completely separate page table
> > >    hierarchy
> > 
> > This is the case for KVM and the VMX/SVM MMUs, but it's not generally
> > true about hardware.  IOMMUs can walk x86 page tables and populate the
> > IOTLB from the _same_ page table hierarchy as the CPU.
> 
> Yes, of course.

Yeah, the recent rework of invalidate_range() => arch_invalidate_secondary_tlbs()
sums things up nicely:

commit 1af5a8109904b7f00828e7f9f63f5695b42f8215
Author:     Alistair Popple <apopple@nvidia.com>
AuthorDate: Tue Jul 25 23:42:07 2023 +1000
Commit:     Andrew Morton <akpm@linux-foundation.org>
CommitDate: Fri Aug 18 10:12:41 2023 -0700

    mmu_notifiers: rename invalidate_range notifier
    
    There are two main use cases for mmu notifiers.  One is by KVM which uses
    mmu_notifier_invalidate_range_start()/end() to manage a software TLB.
    
    The other is to manage hardware TLBs which need to use the
    invalidate_range() callback because HW can establish new TLB entries at
    any time.  Hence using start/end() can lead to memory corruption as these
    callbacks happen too soon/late during page unmap.
    
    mmu notifier users should therefore either use the start()/end() callbacks
    or the invalidate_range() callbacks.  To make this usage clearer rename
    the invalidate_range() callback to arch_invalidate_secondary_tlbs() and
    update documention.
Anthony Yznaga Oct. 7, 2024, 7:23 p.m. UTC | #11
On 10/7/24 2:01 AM, Kirill A. Shutemov wrote:
> On Tue, Sep 03, 2024 at 04:22:31PM -0700, Anthony Yznaga wrote:
>> This patch series implements a mechanism that allows userspace
>> processes to opt into sharing PTEs. It adds a new in-memory
>> filesystem - msharefs. A file created on msharefs represents a
>> shared region where all processes mapping that region will map
>> objects within it with shared PTEs. When the file is created,
>> a new host mm struct is created to hold the shared page tables
>> and vmas for objects later mapped into the shared region. This
>> host mm struct is associated with the file and not with a task.
> Taskless mm_struct can be problematic. Like, we don't have access to it's
> counters because it is not represented in /proc. For instance, there's no
> way to check its smaps.

Definitely needs exposure in /proc. One of the things I'm looking into 
is the feasibility of showing the mappings in maps/smaps/etc..


>
> Also, I *think* it is immune to oom-killer because oom-killer looks for a
> victim task, not mm.
> I hope it is not an intended feature :P

oom-killer would have to kill all sharers of an mshare region before the 
mshare region itself could be freed, but I'm not sure that oom-killer 
would be the one to free the region. An mshare region is essentially a 
shared memory object not unlike a tmpfs or hugetlb file. I think some 
higher level intelligence would have to be involved to release it if 
appropriate when under oom conditions.


>
>> When a process mmap's the shared region, a vm flag VM_SHARED_PT
>> is added to the vma. On page fault the vma is checked for the
>> presence of the VM_SHARED_PT flag.
> I think it is wrong approach.
>
> Instead of spaying VM_SHARED_PT checks across core-mm, we need to add a
> generic hooks that can be used by mshare and hugetlb. And remove
> is_vm_hugetlb_page() check from core-mm along the way.
>
> BTW, is_vm_hugetlb_page() callsites seem to be the indicator to check if
> mshare has to do something differently there. I feel you miss a lot of
> such cases.

Good point about is_vm_hugetlb_page(). I'll review the callsites (there 
are only ~60 of them :-).


Thanks,

Anthony
David Hildenbrand Oct. 7, 2024, 7:41 p.m. UTC | #12
On 07.10.24 21:23, Anthony Yznaga wrote:
> 
> On 10/7/24 2:01 AM, Kirill A. Shutemov wrote:
>> On Tue, Sep 03, 2024 at 04:22:31PM -0700, Anthony Yznaga wrote:
>>> This patch series implements a mechanism that allows userspace
>>> processes to opt into sharing PTEs. It adds a new in-memory
>>> filesystem - msharefs. A file created on msharefs represents a
>>> shared region where all processes mapping that region will map
>>> objects within it with shared PTEs. When the file is created,
>>> a new host mm struct is created to hold the shared page tables
>>> and vmas for objects later mapped into the shared region. This
>>> host mm struct is associated with the file and not with a task.
>> Taskless mm_struct can be problematic. Like, we don't have access to it's
>> counters because it is not represented in /proc. For instance, there's no
>> way to check its smaps.
> 
> Definitely needs exposure in /proc. One of the things I'm looking into
> is the feasibility of showing the mappings in maps/smaps/etc..
> 
> 
>>
>> Also, I *think* it is immune to oom-killer because oom-killer looks for a
>> victim task, not mm.
>> I hope it is not an intended feature :P
> 
> oom-killer would have to kill all sharers of an mshare region before the
> mshare region itself could be freed, but I'm not sure that oom-killer
> would be the one to free the region. An mshare region is essentially a
> shared memory object not unlike a tmpfs or hugetlb file. I think some
> higher level intelligence would have to be involved to release it if
> appropriate when under oom conditions.


I thought we discussed that at LSF/MM last year and the conclusion was 
that a process is needed (OOM kill, cgroup handling, ...), and it must 
remain running. Once it stops running, we can force-unmap all pages etc.

Did you look at the summary/(recording if available) of that, or am I 
remembering things wrongly or was there actually such a discussion?

I know, it's problematic that this feature switched owners, ...
Anthony Yznaga Oct. 7, 2024, 7:46 p.m. UTC | #13
On 10/7/24 12:41 PM, David Hildenbrand wrote:
> On 07.10.24 21:23, Anthony Yznaga wrote:
>>
>> On 10/7/24 2:01 AM, Kirill A. Shutemov wrote:
>>> On Tue, Sep 03, 2024 at 04:22:31PM -0700, Anthony Yznaga wrote:
>>>> This patch series implements a mechanism that allows userspace
>>>> processes to opt into sharing PTEs. It adds a new in-memory
>>>> filesystem - msharefs. A file created on msharefs represents a
>>>> shared region where all processes mapping that region will map
>>>> objects within it with shared PTEs. When the file is created,
>>>> a new host mm struct is created to hold the shared page tables
>>>> and vmas for objects later mapped into the shared region. This
>>>> host mm struct is associated with the file and not with a task.
>>> Taskless mm_struct can be problematic. Like, we don't have access to 
>>> it's
>>> counters because it is not represented in /proc. For instance, 
>>> there's no
>>> way to check its smaps.
>>
>> Definitely needs exposure in /proc. One of the things I'm looking into
>> is the feasibility of showing the mappings in maps/smaps/etc..
>>
>>
>>>
>>> Also, I *think* it is immune to oom-killer because oom-killer looks 
>>> for a
>>> victim task, not mm.
>>> I hope it is not an intended feature :P
>>
>> oom-killer would have to kill all sharers of an mshare region before the
>> mshare region itself could be freed, but I'm not sure that oom-killer
>> would be the one to free the region. An mshare region is essentially a
>> shared memory object not unlike a tmpfs or hugetlb file. I think some
>> higher level intelligence would have to be involved to release it if
>> appropriate when under oom conditions.
>
>
> I thought we discussed that at LSF/MM last year and the conclusion was 
> that a process is needed (OOM kill, cgroup handling, ...), and it must 
> remain running. Once it stops running, we can force-unmap all pages etc.
>
> Did you look at the summary/(recording if available) of that, or am I 
> remembering things wrongly or was there actually such a discussion?

You're likely right. I'll review the discussion.


Anthony


>
> I know, it's problematic that this feature switched owners, ...
Anthony Yznaga Oct. 8, 2024, 1:37 a.m. UTC | #14
On 10/7/24 9:45 AM, Sean Christopherson wrote:
> On Mon, Oct 07, 2024, David Hildenbrand wrote:
>> On 07.10.24 17:58, Dave Hansen wrote:
>>> On 10/7/24 01:44, David Hildenbrand wrote:
>>>> On 02.10.24 19:35, Dave Hansen wrote:
>>>>> We were just chatting about this on David Rientjes's MM alignment call.
>>>> Unfortunately I was not able to attend this time, my body decided it's a
>>>> good idea to stay in bed for a couple of days.
>>>>
>>>>> I thought I'd try to give a little brain
>>>>>
>>>>> Let's start by thinking about KVM and secondary MMUs.  KVM has a primary
>>>>> mm: the QEMU (or whatever) process mm.  The virtualization (EPT/NPT)
>>>>> tables get entries that effectively mirror the primary mm page tables
>>>>> and constitute a secondary MMU.  If the primary page tables change,
>>>>> mmu_notifiers ensure that the changes get reflected into the
>>>>> virtualization tables and also that the virtualization paging structure
>>>>> caches are flushed.
>>>>>
>>>>> msharefs is doing something very similar.  But, in the msharefs case,
>>>>> the secondary MMUs are actually normal CPU MMUs.  The page tables are
>>>>> normal old page tables and the caches are the normal old TLB.  That's
>>>>> what makes it so confusing: we have lots of infrastructure for dealing
>>>>> with that "stuff" (CPU page tables and TLB), but msharefs has
>>>>> short-circuited the infrastructure and it doesn't work any more.
>>>> It's quite different IMHO, to a degree that I believe they are different
>>>> beasts:
>>>>
>>>> Secondary MMUs:
>>>> * "Belongs" to same MM context and the primary MMU (process page tables)
>>> I think you're speaking to the ratio here.  For each secondary MMU, I
>>> think you're saying that there's one and only one mm_struct.  Is that right?
>> Yes, that is my understanding (at least with KVM). It's a secondary MMU
>> derived from exactly one primary MMU (MM context -> page table hierarchy).
> I don't think the ratio is what's important.  I think the important takeaway is
> that the secondary MMU is explicitly tied to the primary MMU that it is tracking.
> This is enforced in code, as the list of mmu_notifiers is stored in mm_struct.
>
> The 1:1 ratio probably holds true today, e.g. for KVM, each VM is associated with
> exactly one mm_struct.  But fundamentally, nothing would prevent a secondary MMU
> that manages a so called software TLB from tracking multiple primary MMUs.
>
> E.g. it wouldn't be all that hard to implement in KVM (a bit crazy, but not hard),
> because KVM's memslots disallow gfn aliases, i.e. each index into KVM's secondary
> MMU would be associated with at most one VMA and thus mm_struct.
>
> Pulling Dave's earlier comment in:
>
>   : But the short of it is that the msharefs host mm represents a "secondary
>   : MMU".  I don't think it is really that special of an MMU other than the
>   : fact that it has an mm_struct.
>
> and David's (so. many. Davids):
>
>   : I better not think about the complexity of seconary MMUs + mshare (e.g.,
>   : KVM with mshare in guest memory): MMU notifiers for all MMs must be
>   : called ...
>
> mshare() is unique because it creates the possibly of chained "secondary" MMUs.
> I.e. the fact that it has an mm_struct makes it *very* special, IMO.

This is definitely a gap with the current mshare implementation. Mapping 
memory

that relies on an mmu notifier in an mshare region will result in the 
notifier callbacks

never being called. On the surface it seems like the mshare mm needs 
notifiers that

go through every mm that has mapped the mshare region and calls its 
notifiers.


>
>>>> * Maintains separate tables/PTEs, in completely separate page table
>>>>     hierarchy
>>> This is the case for KVM and the VMX/SVM MMUs, but it's not generally
>>> true about hardware.  IOMMUs can walk x86 page tables and populate the
>>> IOTLB from the _same_ page table hierarchy as the CPU.
>> Yes, of course.
> Yeah, the recent rework of invalidate_range() => arch_invalidate_secondary_tlbs()
> sums things up nicely:
>
> commit 1af5a8109904b7f00828e7f9f63f5695b42f8215
> Author:     Alistair Popple <apopple@nvidia.com>
> AuthorDate: Tue Jul 25 23:42:07 2023 +1000
> Commit:     Andrew Morton <akpm@linux-foundation.org>
> CommitDate: Fri Aug 18 10:12:41 2023 -0700
>
>      mmu_notifiers: rename invalidate_range notifier
>      
>      There are two main use cases for mmu notifiers.  One is by KVM which uses
>      mmu_notifier_invalidate_range_start()/end() to manage a software TLB.
>      
>      The other is to manage hardware TLBs which need to use the
>      invalidate_range() callback because HW can establish new TLB entries at
>      any time.  Hence using start/end() can lead to memory corruption as these
>      callbacks happen too soon/late during page unmap.
>      
>      mmu notifier users should therefore either use the start()/end() callbacks
>      or the invalidate_range() callbacks.  To make this usage clearer rename
>      the invalidate_range() callback to arch_invalidate_secondary_tlbs() and
>      update documention.

I believe if I implemented an arch_invalidate_secondary_tlbs notifier 
that flushed all

TLBs, that would solve the problem of ensuring TLBs are flushed before 
pages being

unmapped from an mshare region are freed. However, it seems like there 
would be

potentially be a lot more collateral damage from flushing everything 
since the flushes

would happen more often.
Jann Horn Oct. 14, 2024, 8:07 p.m. UTC | #15
On Wed, Sep 4, 2024 at 1:22 AM Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
> One major issue to address for this series to function correctly
> is how to ensure proper TLB flushing when a page in a shared
> region is unmapped. For example, since the rmaps for pages in a
> shared region map back to host vmas which point to a host mm, TLB
> flushes won't be directed to the CPUs the sharing processes have
> run on. I am by no means an expert in this area. One idea is to
> install a mmu_notifier on the host mm that can gather the necessary
> data and do flushes similar to the batch flushing.

The mmu_notifier API has two ways you can use it:

First, there is the classic mode, where before you start modifying
PTEs in some range, you remove mirrored PTEs from some other context,
and until you're done with your PTE modification, you don't allow
creation of new mirrored PTEs. This is intended for cases where
individual PTE entries are copied over to some other context (such as
EPT tables for virtualization). When I last looked at that code, it
looked fine, and this is what KVM uses. But it probably doesn't match
your usecase, since you wouldn't want removal of a single page to
cause the entire page table containing it to be temporarily unmapped
from the processes that use it?

Second, there is a newer mode for IOMMUv2 stuff (using the
mmu_notifier_ops::invalidate_range callback), where the idea is that
you have secondary MMUs that share the normal page tables, and so you
basically send them invalidations at the same time you invalidate the
primary MMU for the process. I think that's the right fit for this
usecase; however, last I looked, this code was extremely broken (see
https://lore.kernel.org/lkml/CAG48ez2NQKVbv=yG_fq_jtZjf8Q=+Wy54FxcFrK_OujFg5BwSQ@mail.gmail.com/
for context). Unless that's changed in the meantime, I think someone
would have to fix that code before it can be relied on for new
usecases.
Anthony Yznaga Oct. 16, 2024, 12:59 a.m. UTC | #16
On 10/14/24 1:07 PM, Jann Horn wrote:
> On Wed, Sep 4, 2024 at 1:22 AM Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
>> One major issue to address for this series to function correctly
>> is how to ensure proper TLB flushing when a page in a shared
>> region is unmapped. For example, since the rmaps for pages in a
>> shared region map back to host vmas which point to a host mm, TLB
>> flushes won't be directed to the CPUs the sharing processes have
>> run on. I am by no means an expert in this area. One idea is to
>> install a mmu_notifier on the host mm that can gather the necessary
>> data and do flushes similar to the batch flushing.
> The mmu_notifier API has two ways you can use it:
>
> First, there is the classic mode, where before you start modifying
> PTEs in some range, you remove mirrored PTEs from some other context,
> and until you're done with your PTE modification, you don't allow
> creation of new mirrored PTEs. This is intended for cases where
> individual PTE entries are copied over to some other context (such as
> EPT tables for virtualization). When I last looked at that code, it
> looked fine, and this is what KVM uses. But it probably doesn't match
> your usecase, since you wouldn't want removal of a single page to
> cause the entire page table containing it to be temporarily unmapped
> from the processes that use it?

No, definitely don't want to do that. :-)


>
> Second, there is a newer mode for IOMMUv2 stuff (using the
> mmu_notifier_ops::invalidate_range callback), where the idea is that
> you have secondary MMUs that share the normal page tables, and so you
> basically send them invalidations at the same time you invalidate the
> primary MMU for the process. I think that's the right fit for this
> usecase; however, last I looked, this code was extremely broken (see
> https://lore.kernel.org/lkml/CAG48ez2NQKVbv=yG_fq_jtZjf8Q=+Wy54FxcFrK_OujFg5BwSQ@mail.gmail.com/
> for context). Unless that's changed in the meantime, I think someone
> would have to fix that code before it can be relied on for new
> usecases.

Thank you for this background! Looks like there have since been some 
changes to the mmu notifiers, and the invalidate_range callback became 
arch_invalidate_secondary_tlbs. I'm currently looking into using it to 
flush all TLBs.


Anthony
Jann Horn Oct. 16, 2024, 1:25 p.m. UTC | #17
On Wed, Oct 16, 2024 at 2:59 AM Anthony Yznaga
<anthony.yznaga@oracle.com> wrote:
> On 10/14/24 1:07 PM, Jann Horn wrote:
> > Second, there is a newer mode for IOMMUv2 stuff (using the
> > mmu_notifier_ops::invalidate_range callback), where the idea is that
> > you have secondary MMUs that share the normal page tables, and so you
> > basically send them invalidations at the same time you invalidate the
> > primary MMU for the process. I think that's the right fit for this
> > usecase; however, last I looked, this code was extremely broken (see
> > https://lore.kernel.org/lkml/CAG48ez2NQKVbv=yG_fq_jtZjf8Q=+Wy54FxcFrK_OujFg5BwSQ@mail.gmail.com/
> > for context). Unless that's changed in the meantime, I think someone
> > would have to fix that code before it can be relied on for new
> > usecases.
>
> Thank you for this background! Looks like there have since been some
> changes to the mmu notifiers, and the invalidate_range callback became
> arch_invalidate_secondary_tlbs. I'm currently looking into using it to
> flush all TLBs.

Ah, nice, that looks much better now. With the caveat that, from what
I can tell, the notifiers only work on x86/arm64/powerpc - I guess
maybe that infrastructure should be gated on a HAVE_... arch config
flag...