mbox series

[V2,0/6] VA to numa node information

Message ID 1536783844-4145-1-git-send-email-prakash.sangappa@oracle.com (mailing list archive)
Headers show
Series VA to numa node information | expand

Message

Prakash Sangappa Sept. 12, 2018, 8:23 p.m. UTC
For analysis purpose it is useful to have numa node information
corresponding mapped virtual address ranges of a process. Currently,
the file /proc/<pid>/numa_maps provides list of numa nodes from where pages
are allocated per VMA of a process. This is not useful if an user needs to
determine which numa node the mapped pages are allocated from for a
particular address range. It would have helped if the numa node information
presented in /proc/<pid>/numa_maps was broken down by VA ranges showing the
exact numa node from where the pages have been allocated.

The format of /proc/<pid>/numa_maps file content is dependent on
/proc/<pid>/maps file content as mentioned in the manpage. i.e one line
entry for every VMA corresponding to entries in /proc/<pids>/maps file.
Therefore changing the output of /proc/<pid>/numa_maps may not be possible.

This patch set introduces the file /proc/<pid>/numa_vamaps which
will provide proper break down of VA ranges by numa node id from where the
mapped pages are allocated. For Address ranges not having any pages mapped,
a '-' is printed instead of numa node id.

Includes support to lseek, allowing seeking to a specific process Virtual
address(VA) starting from where the address range to numa node information
can to be read from this file.

The new file /proc/<pid>/numa_vamaps will be governed by ptrace access
mode PTRACE_MODE_READ_REALCREDS.

See following for previous discussion about this proposal

https://marc.info/?t=152524073400001&r=1&w=2


Prakash Sangappa (6):
  Add check to match numa node id when gathering pte stats
  Add /proc/<pid>/numa_vamaps file for numa node information
  Provide process address range to numa node id mapping
  Add support to lseek /proc/<pid>/numa_vamaps file
  File /proc/<pid>/numa_vamaps access needs PTRACE_MODE_READ_REALCREDS
    check
  /proc/pid/numa_vamaps: document in Documentation/filesystems/proc.txt

 Documentation/filesystems/proc.txt |  21 +++
 fs/proc/base.c                     |   6 +-
 fs/proc/internal.h                 |   1 +
 fs/proc/task_mmu.c                 | 265 ++++++++++++++++++++++++++++++++++++-
 4 files changed, 285 insertions(+), 8 deletions(-)

Comments

Michal Hocko Sept. 13, 2018, 8:40 a.m. UTC | #1
On Wed 12-09-18 13:23:58, Prakash Sangappa wrote:
> For analysis purpose it is useful to have numa node information
> corresponding mapped virtual address ranges of a process. Currently,
> the file /proc/<pid>/numa_maps provides list of numa nodes from where pages
> are allocated per VMA of a process. This is not useful if an user needs to
> determine which numa node the mapped pages are allocated from for a
> particular address range. It would have helped if the numa node information
> presented in /proc/<pid>/numa_maps was broken down by VA ranges showing the
> exact numa node from where the pages have been allocated.
> 
> The format of /proc/<pid>/numa_maps file content is dependent on
> /proc/<pid>/maps file content as mentioned in the manpage. i.e one line
> entry for every VMA corresponding to entries in /proc/<pids>/maps file.
> Therefore changing the output of /proc/<pid>/numa_maps may not be possible.
> 
> This patch set introduces the file /proc/<pid>/numa_vamaps which
> will provide proper break down of VA ranges by numa node id from where the
> mapped pages are allocated. For Address ranges not having any pages mapped,
> a '-' is printed instead of numa node id.
> 
> Includes support to lseek, allowing seeking to a specific process Virtual
> address(VA) starting from where the address range to numa node information
> can to be read from this file.
> 
> The new file /proc/<pid>/numa_vamaps will be governed by ptrace access
> mode PTRACE_MODE_READ_REALCREDS.
> 
> See following for previous discussion about this proposal
> 
> https://marc.info/?t=152524073400001&r=1&w=2

It would be really great to give a short summary of the previous
discussion. E.g. why do we need a proc interface in the first place when
we already have an API to query for the information you are proposing to
export [1]

[1] http://lkml.kernel.org/r/20180503085741.GD4535@dhcp22.suse.cz
Prakash Sangappa Sept. 13, 2018, 10:32 p.m. UTC | #2
On 09/13/2018 01:40 AM, Michal Hocko wrote:
> On Wed 12-09-18 13:23:58, Prakash Sangappa wrote:
>> For analysis purpose it is useful to have numa node information
>> corresponding mapped virtual address ranges of a process. Currently,
>> the file /proc/<pid>/numa_maps provides list of numa nodes from where pages
>> are allocated per VMA of a process. This is not useful if an user needs to
>> determine which numa node the mapped pages are allocated from for a
>> particular address range. It would have helped if the numa node information
>> presented in /proc/<pid>/numa_maps was broken down by VA ranges showing the
>> exact numa node from where the pages have been allocated.
>>
>> The format of /proc/<pid>/numa_maps file content is dependent on
>> /proc/<pid>/maps file content as mentioned in the manpage. i.e one line
>> entry for every VMA corresponding to entries in /proc/<pids>/maps file.
>> Therefore changing the output of /proc/<pid>/numa_maps may not be possible.
>>
>> This patch set introduces the file /proc/<pid>/numa_vamaps which
>> will provide proper break down of VA ranges by numa node id from where the
>> mapped pages are allocated. For Address ranges not having any pages mapped,
>> a '-' is printed instead of numa node id.
>>
>> Includes support to lseek, allowing seeking to a specific process Virtual
>> address(VA) starting from where the address range to numa node information
>> can to be read from this file.
>>
>> The new file /proc/<pid>/numa_vamaps will be governed by ptrace access
>> mode PTRACE_MODE_READ_REALCREDS.
>>
>> See following for previous discussion about this proposal
>>
>> https://marc.info/?t=152524073400001&r=1&w=2
> It would be really great to give a short summary of the previous
> discussion. E.g. why do we need a proc interface in the first place when
> we already have an API to query for the information you are proposing to
> export [1]
>
> [1] http://lkml.kernel.org/r/20180503085741.GD4535@dhcp22.suse.cz

The proc interface provides an efficient way to export address range
to numa node id mapping information compared to using the API.
For example, for sparsely populated mappings, if a VMA has large portions
not have any physical pages mapped, the page walk done thru the /proc file
interface can skip over non existent PMDs / ptes. Whereas using the
API the application would have to scan the entire VMA in page size units.

Also, VMAs having THP pages can have a mix of 4k pages and hugepages.
The page walks would be efficient in scanning and determining if it is
a THP huge page and step over it. Whereas using the API, the application
would not know what page size mapping is used for a given VA and so would
have to again scan the VMA in units of 4k page size.

If this sounds reasonable, I can add it to the commit / patch description.

-Prakash.
Andrew Morton Sept. 14, 2018, 12:10 a.m. UTC | #3
On Thu, 13 Sep 2018 15:32:25 -0700 "prakash.sangappa" <prakash.sangappa@oracle.com> wrote:

> >> https://marc.info/?t=152524073400001&r=1&w=2
> > It would be really great to give a short summary of the previous
> > discussion. E.g. why do we need a proc interface in the first place when
> > we already have an API to query for the information you are proposing to
> > export [1]
> >
> > [1] http://lkml.kernel.org/r/20180503085741.GD4535@dhcp22.suse.cz
> 
> The proc interface provides an efficient way to export address range
> to numa node id mapping information compared to using the API.
> For example, for sparsely populated mappings, if a VMA has large portions
> not have any physical pages mapped, the page walk done thru the /proc file
> interface can skip over non existent PMDs / ptes. Whereas using the
> API the application would have to scan the entire VMA in page size units.
> 
> Also, VMAs having THP pages can have a mix of 4k pages and hugepages.
> The page walks would be efficient in scanning and determining if it is
> a THP huge page and step over it. Whereas using the API, the application
> would not know what page size mapping is used for a given VA and so would
> have to again scan the VMA in units of 4k page size.
> 
> If this sounds reasonable, I can add it to the commit / patch description.

Preferably with some runtime measurements, please.  How much faster is
this interface in real-world situations?  And why does that performance
matter?

It would also be useful to see more details on how this info helps
operators understand/tune/etc their applications and workloads.  In
other words, I'm trying to get an understanding of how useful this code
might be to our users in general.
Dave Hansen Sept. 14, 2018, 12:25 a.m. UTC | #4
On 09/13/2018 05:10 PM, Andrew Morton wrote:
>> Also, VMAs having THP pages can have a mix of 4k pages and hugepages.
>> The page walks would be efficient in scanning and determining if it is
>> a THP huge page and step over it. Whereas using the API, the application
>> would not know what page size mapping is used for a given VA and so would
>> have to again scan the VMA in units of 4k page size.
>>
>> If this sounds reasonable, I can add it to the commit / patch description.

As we are judging whether this is a "good" interface, can you tell us a
bit about its scalability?  For instance, let's say someone has a 1TB
VMA that's populated with interleaved 4k pages.  How much data comes
out?  How long does it take to parse?  Will we effectively deadlock the
system if someone accidentally cat's the wrong /proc file?

/proc seems like a really simple way to implement this, but it seems a
*really* odd choice for something that needs to collect a large amount
of data.  The lseek() stuff is a nice addition, but I wonder if it's
unwieldy to use in practice.  For instance, if you want to read data for
the VMA at 0x1000000 you lseek(fd, 0x1000000, SEEK_SET, right?  You read
~20 bytes of data and then the fd is at 0x1000020.  But, you're getting
data out at the next read() for (at least) the next page, which is also
available at 0x1001000.  Seems funky.  Do other /proc files behave this way?
Michal Hocko Sept. 14, 2018, 5:56 a.m. UTC | #5
On Thu 13-09-18 15:32:25, prakash.sangappa wrote:
> 
> 
> On 09/13/2018 01:40 AM, Michal Hocko wrote:
> > On Wed 12-09-18 13:23:58, Prakash Sangappa wrote:
> > > For analysis purpose it is useful to have numa node information
> > > corresponding mapped virtual address ranges of a process. Currently,
> > > the file /proc/<pid>/numa_maps provides list of numa nodes from where pages
> > > are allocated per VMA of a process. This is not useful if an user needs to
> > > determine which numa node the mapped pages are allocated from for a
> > > particular address range. It would have helped if the numa node information
> > > presented in /proc/<pid>/numa_maps was broken down by VA ranges showing the
> > > exact numa node from where the pages have been allocated.
> > > 
> > > The format of /proc/<pid>/numa_maps file content is dependent on
> > > /proc/<pid>/maps file content as mentioned in the manpage. i.e one line
> > > entry for every VMA corresponding to entries in /proc/<pids>/maps file.
> > > Therefore changing the output of /proc/<pid>/numa_maps may not be possible.
> > > 
> > > This patch set introduces the file /proc/<pid>/numa_vamaps which
> > > will provide proper break down of VA ranges by numa node id from where the
> > > mapped pages are allocated. For Address ranges not having any pages mapped,
> > > a '-' is printed instead of numa node id.
> > > 
> > > Includes support to lseek, allowing seeking to a specific process Virtual
> > > address(VA) starting from where the address range to numa node information
> > > can to be read from this file.
> > > 
> > > The new file /proc/<pid>/numa_vamaps will be governed by ptrace access
> > > mode PTRACE_MODE_READ_REALCREDS.
> > > 
> > > See following for previous discussion about this proposal
> > > 
> > > https://marc.info/?t=152524073400001&r=1&w=2
> > It would be really great to give a short summary of the previous
> > discussion. E.g. why do we need a proc interface in the first place when
> > we already have an API to query for the information you are proposing to
> > export [1]
> > 
> > [1] http://lkml.kernel.org/r/20180503085741.GD4535@dhcp22.suse.cz
> 
> The proc interface provides an efficient way to export address range
> to numa node id mapping information compared to using the API.

Do you have any numbers?

> For example, for sparsely populated mappings, if a VMA has large portions
> not have any physical pages mapped, the page walk done thru the /proc file
> interface can skip over non existent PMDs / ptes. Whereas using the
> API the application would have to scan the entire VMA in page size units.

What prevents you from pre-filtering by reading /proc/$pid/maps to get
ranges of interest?

> Also, VMAs having THP pages can have a mix of 4k pages and hugepages.
> The page walks would be efficient in scanning and determining if it is
> a THP huge page and step over it. Whereas using the API, the application
> would not know what page size mapping is used for a given VA and so would
> have to again scan the VMA in units of 4k page size.

Why does this matter for something that is for analysis purposes.
Reading the file for the whole address space is far from a free
operation. Is the page walk optimization really essential for usability?
Moreover what prevents move_pages implementation to be clever for the
page walk itself? In other words why would we want to add a new API
rather than make the existing one faster for everybody.
 
> If this sounds reasonable, I can add it to the commit / patch description.

This all is absolutely _essential_ for any new API proposed. Remember that
once we add a new user interface, we have to maintain it for ever. We
used to be too relaxed when adding new proc files in the past and it
backfired many times already.
Steven Sistare Sept. 14, 2018, 4:01 p.m. UTC | #6
On 9/14/2018 1:56 AM, Michal Hocko wrote:
> On Thu 13-09-18 15:32:25, prakash.sangappa wrote:
>> On 09/13/2018 01:40 AM, Michal Hocko wrote:
>>> On Wed 12-09-18 13:23:58, Prakash Sangappa wrote:
>>>> For analysis purpose it is useful to have numa node information
>>>> corresponding mapped virtual address ranges of a process. Currently,
>>>> the file /proc/<pid>/numa_maps provides list of numa nodes from where pages
>>>> are allocated per VMA of a process. This is not useful if an user needs to
>>>> determine which numa node the mapped pages are allocated from for a
>>>> particular address range. It would have helped if the numa node information
>>>> presented in /proc/<pid>/numa_maps was broken down by VA ranges showing the
>>>> exact numa node from where the pages have been allocated.
>>>>
>>>> The format of /proc/<pid>/numa_maps file content is dependent on
>>>> /proc/<pid>/maps file content as mentioned in the manpage. i.e one line
>>>> entry for every VMA corresponding to entries in /proc/<pids>/maps file.
>>>> Therefore changing the output of /proc/<pid>/numa_maps may not be possible.
>>>>
>>>> This patch set introduces the file /proc/<pid>/numa_vamaps which
>>>> will provide proper break down of VA ranges by numa node id from where the
>>>> mapped pages are allocated. For Address ranges not having any pages mapped,
>>>> a '-' is printed instead of numa node id.
>>>>
>>>> Includes support to lseek, allowing seeking to a specific process Virtual
>>>> address(VA) starting from where the address range to numa node information
>>>> can to be read from this file.
>>>>
>>>> The new file /proc/<pid>/numa_vamaps will be governed by ptrace access
>>>> mode PTRACE_MODE_READ_REALCREDS.
>>>>
>>>> See following for previous discussion about this proposal
>>>>
>>>> https://marc.info/?t=152524073400001&r=1&w=2
>>> It would be really great to give a short summary of the previous
>>> discussion. E.g. why do we need a proc interface in the first place when
>>> we already have an API to query for the information you are proposing to
>>> export [1]
>>>
>>> [1] http://lkml.kernel.org/r/20180503085741.GD4535@dhcp22.suse.cz
>>
>> The proc interface provides an efficient way to export address range
>> to numa node id mapping information compared to using the API.
> 
> Do you have any numbers?
> 
>> For example, for sparsely populated mappings, if a VMA has large portions
>> not have any physical pages mapped, the page walk done thru the /proc file
>> interface can skip over non existent PMDs / ptes. Whereas using the
>> API the application would have to scan the entire VMA in page size units.
> 
> What prevents you from pre-filtering by reading /proc/$pid/maps to get
> ranges of interest?

That works for skipping holes, but not for skipping huge pages.  I did a 
quick experiment to time move_pages on a 3 GHz Xeon and a 4.18 kernel.  
Allocate 128 GB and touch every small page.  Call move_pages with nodes=NULL 
to get the node id for all pages, passing 512 consecutive small pages per 
call to move_nodes. The total move_nodes time is 1.85 secs, and 55 nsec 
per page.  Extrapolating to a 1 TB range, it would take 15 sec to retrieve 
the numa node for every small page in the range.  That is not terrible, but 
it is not interactive, and it becomes terrible for multiple TB.

>> Also, VMAs having THP pages can have a mix of 4k pages and hugepages.
>> The page walks would be efficient in scanning and determining if it is
>> a THP huge page and step over it. Whereas using the API, the application
>> would not know what page size mapping is used for a given VA and so would
>> have to again scan the VMA in units of 4k page size.
> 
> Why does this matter for something that is for analysis purposes.
> Reading the file for the whole address space is far from a free
> operation. Is the page walk optimization really essential for usability?
> Moreover what prevents move_pages implementation to be clever for the
> page walk itself? In other words why would we want to add a new API
> rather than make the existing one faster for everybody.

One could optimize move pages.  If the caller passes a consecutive range
of small pages, and the page walk sees that a VA is mapped by a huge page, 
then it can return the same numa node for each of the following VA's that fall 
into the huge page range. It would be faster than 55 nsec per small page, but 
hard to say how much faster, and the cost is still driven by the number of 
small pages. 
 
>> If this sounds reasonable, I can add it to the commit / patch description.
> 
> This all is absolutely _essential_ for any new API proposed. Remember that
> once we add a new user interface, we have to maintain it for ever. We
> used to be too relaxed when adding new proc files in the past and it
> backfired many times already.

An offhand idea -- we could extend /proc/pid/numa_maps in a backward compatible
way by providing a control interface that is poked via write() or ioctl().
Provide one control "do-not-combine".  If do-not-combine has been set, then
the read() function returns a separate line for each range of memory mapped
on the same numa node, in the existing format.

- Steve
Prakash Sangappa Sept. 14, 2018, 6:04 p.m. UTC | #7
On 9/14/18 9:01 AM, Steven Sistare wrote:
> On 9/14/2018 1:56 AM, Michal Hocko wrote:
>> On Thu 13-09-18 15:32:25, prakash.sangappa wrote:
>>>
>>> The proc interface provides an efficient way to export address range
>>> to numa node id mapping information compared to using the API.
>> Do you have any numbers?
>>
>>> For example, for sparsely populated mappings, if a VMA has large portions
>>> not have any physical pages mapped, the page walk done thru the /proc file
>>> interface can skip over non existent PMDs / ptes. Whereas using the
>>> API the application would have to scan the entire VMA in page size units.
>> What prevents you from pre-filtering by reading /proc/$pid/maps to get
>> ranges of interest?
> That works for skipping holes, but not for skipping huge pages.  I did a
> quick experiment to time move_pages on a 3 GHz Xeon and a 4.18 kernel.
> Allocate 128 GB and touch every small page.  Call move_pages with nodes=NULL
> to get the node id for all pages, passing 512 consecutive small pages per
> call to move_nodes. The total move_nodes time is 1.85 secs, and 55 nsec
> per page.  Extrapolating to a 1 TB range, it would take 15 sec to retrieve
> the numa node for every small page in the range.  That is not terrible, but
> it is not interactive, and it becomes terrible for multiple TB.
>

Also, for valid VMAs in  'maps' file, if the VMA is sparsely populated 
with  physical pages,
the page walk can skip over non existing page table entires (PMDs) and 
so can be faster.

For example  reading va range of a 400GB VMA which has few pages mapped
in beginning and few pages at the end and the rest of VMA does not have 
any pages, it
takes 0.001s using the /proc interface. Whereas with move_page() api 
passing 1024
consecutive small pages address, it takes about 2.4secs. This is on a 
similar system
running 4.19 kernel.
Dave Hansen Sept. 14, 2018, 7:01 p.m. UTC | #8
On 09/14/2018 11:04 AM, Prakash Sangappa wrote:
> Also, for valid VMAs in  'maps' file, if the VMA is sparsely
> populated with  physical pages, the page walk can skip over non
> existing page table entires (PMDs) and so can be faster.
Note that this only works for things that were _never_ populated.  They
might be sparse after once being populated and then being reclaimed or
discarded.  Those will still have all the page tables allocated.
Prakash Sangappa Sept. 15, 2018, 1:31 a.m. UTC | #9
On 9/13/2018 5:25 PM, Dave Hansen wrote:
> On 09/13/2018 05:10 PM, Andrew Morton wrote:
>>> Also, VMAs having THP pages can have a mix of 4k pages and hugepages.
>>> The page walks would be efficient in scanning and determining if it is
>>> a THP huge page and step over it. Whereas using the API, the application
>>> would not know what page size mapping is used for a given VA and so would
>>> have to again scan the VMA in units of 4k page size.
>>>
>>> If this sounds reasonable, I can add it to the commit / patch description.
> As we are judging whether this is a "good" interface, can you tell us a
> bit about its scalability?  For instance, let's say someone has a 1TB
> VMA that's populated with interleaved 4k pages.  How much data comes
> out?  How long does it take to parse?  Will we effectively deadlock the
> system if someone accidentally cat's the wrong /proc file?

For the worst case scenario you describe, it would be one line(range) 
for each 4k. Which would
be similar to what you get with  '/proc/*/pagemap'. The amount of data 
copied out at a
time is based on the buffer size used in the kernel. Which is 1024. That 
is if one line(one range)
printed is about 40 bytes(char), that means  about 25 lines per copy 
out.  Main concern would
be holding  'mmap_sem' lock, which can cause hangs. When the 1024 buffer 
gets filled the
mmap_sem is dropped and the buffer content is copied out to the user 
buffer. Then the
mmap_sem lock is reacquired and the page walk continues as needed until 
the specified user
buffer size is filed or till end of process address space is reached.

One potential issue could be that there is  a large VA range with all 
pages populated from
one numa node, then the page walk could take longer while holding 
mmap_sem lock. This
can be addressed by dropping and re-acquiring the mmap_sem lock after 
certain number of
pages have been walked(Say 512 - which is what happens in 
'/proc/*/pagemap' case).

>
> /proc seems like a really simple way to implement this, but it seems a
> *really* odd choice for something that needs to collect a large amount
> of data.  The lseek() stuff is a nice addition, but I wonder if it's
> unwieldy to use in practice.  For instance, if you want to read data for
> the VMA at 0x1000000 you lseek(fd, 0x1000000, SEEK_SET, right?  You read
> ~20 bytes of data and then the fd is at 0x1000020.  But, you're getting
> data out at the next read() for (at least) the next page, which is also
> available at 0x1001000.  Seems funky.  Do other /proc files behave this way?
>
Yes, SEEK_SET to the VA.  The lseek offset is the process VA. So it is 
not going to be
different from reading a normal text file.  Expect that  /proc files are 
special. Ex In
/proc/*/pagemap' file case, read enforces that seek/file offset and the 
user buffer size
passed in to  be a  multiple of the pagemap_entry_t  size or else the 
read would fail.

The usage for numa_vamaps file will be to SEEK_SET to the VA from where 
VA range
to numa node information needs to be read.

The  'fd' offset is not taken into consideration here, just the VA. Say 
each va range to numa
node id printed is about 40 bytes(chars). Now if  the read only read 20 
bytes, it would have read
part of the line. Subsequent read would read the remaining bytes of the 
line, which will
be stored in the kernel buffer.
Michal Hocko Sept. 24, 2018, 5:14 p.m. UTC | #10
On Fri 14-09-18 12:01:18, Steven Sistare wrote:
> On 9/14/2018 1:56 AM, Michal Hocko wrote:
[...]
> > Why does this matter for something that is for analysis purposes.
> > Reading the file for the whole address space is far from a free
> > operation. Is the page walk optimization really essential for usability?
> > Moreover what prevents move_pages implementation to be clever for the
> > page walk itself? In other words why would we want to add a new API
> > rather than make the existing one faster for everybody.
> 
> One could optimize move pages.  If the caller passes a consecutive range
> of small pages, and the page walk sees that a VA is mapped by a huge page, 
> then it can return the same numa node for each of the following VA's that fall 
> into the huge page range. It would be faster than 55 nsec per small page, but 
> hard to say how much faster, and the cost is still driven by the number of 
> small pages. 

This is exactly what I was arguing for. There is some room for
improvements for the existing interface. I yet have to hear the explicit
usecase which would required even better performance that cannot be
achieved by the existing API.
Prakash Sangappa Nov. 10, 2018, 4:48 a.m. UTC | #11
On 9/24/18 10:14 AM, Michal Hocko wrote:
> On Fri 14-09-18 12:01:18, Steven Sistare wrote:
>> On 9/14/2018 1:56 AM, Michal Hocko wrote:
> [...]
>>> Why does this matter for something that is for analysis purposes.
>>> Reading the file for the whole address space is far from a free
>>> operation. Is the page walk optimization really essential for usability?
>>> Moreover what prevents move_pages implementation to be clever for the
>>> page walk itself? In other words why would we want to add a new API
>>> rather than make the existing one faster for everybody.
>> One could optimize move pages.  If the caller passes a consecutive range
>> of small pages, and the page walk sees that a VA is mapped by a huge page,
>> then it can return the same numa node for each of the following VA's that fall
>> into the huge page range. It would be faster than 55 nsec per small page, but
>> hard to say how much faster, and the cost is still driven by the number of
>> small pages.
> This is exactly what I was arguing for. There is some room for
> improvements for the existing interface. I yet have to hear the explicit
> usecase which would required even better performance that cannot be
> achieved by the existing API.
>

Above mentioned optimization to move_pages() API helps when scanning
mapped huge pages, but does not help if there are large sparse mappings
with few pages mapped. Otherwise, consider adding page walk support in
the move_pages() implementation, enhance the API(new flag?) to return
address range to numa node information. The page walk optimization
would certainly make a difference for usability.

We can have applications(Like Oracle DB) having processes with large sparse
mappings(in TBs)  with only some areas of these mapped address range
being accessed, basically  large portions not having page tables backing 
it.
This can become more prevalent on newer systems with multiple TBs of
memory.

Here is some data from pmap using move_pages() API  with optimization.
Following table compares time pmap takes to print address mapping of a
large process, with numa node information using move_pages() api vs pmap
using /proc numa_vamaps file.

Running pmap command on a process with 1.3 TB of address space, with
sparse mappings.

                        ~1.3 TB sparse      250G dense segment with hugepages.
move_pages              8.33s              3.14
optimized move_pages    6.29s              0.92
/proc numa_vamaps       0.08s              0.04

  
Second column is pmap time on a 250G address range of this process, which maps
hugepages(THP & hugetlb).
Steven Sistare Nov. 26, 2018, 7:20 p.m. UTC | #12
On 11/9/2018 11:48 PM, Prakash Sangappa wrote:
> On 9/24/18 10:14 AM, Michal Hocko wrote:
>> On Fri 14-09-18 12:01:18, Steven Sistare wrote:
>>> On 9/14/2018 1:56 AM, Michal Hocko wrote:
>> [...]
>>>> Why does this matter for something that is for analysis purposes.
>>>> Reading the file for the whole address space is far from a free
>>>> operation. Is the page walk optimization really essential for usability?
>>>> Moreover what prevents move_pages implementation to be clever for the
>>>> page walk itself? In other words why would we want to add a new API
>>>> rather than make the existing one faster for everybody.
>>> One could optimize move pages.  If the caller passes a consecutive range
>>> of small pages, and the page walk sees that a VA is mapped by a huge page,
>>> then it can return the same numa node for each of the following VA's that fall
>>> into the huge page range. It would be faster than 55 nsec per small page, but
>>> hard to say how much faster, and the cost is still driven by the number of
>>> small pages.
>> This is exactly what I was arguing for. There is some room for
>> improvements for the existing interface. I yet have to hear the explicit
>> usecase which would required even better performance that cannot be
>> achieved by the existing API.
>>
> 
> Above mentioned optimization to move_pages() API helps when scanning
> mapped huge pages, but does not help if there are large sparse mappings
> with few pages mapped. Otherwise, consider adding page walk support in
> the move_pages() implementation, enhance the API(new flag?) to return
> address range to numa node information. The page walk optimization
> would certainly make a difference for usability.
> 
> We can have applications(Like Oracle DB) having processes with large sparse
> mappings(in TBs)  with only some areas of these mapped address range
> being accessed, basically  large portions not having page tables backing it.
> This can become more prevalent on newer systems with multiple TBs of
> memory.
> 
> Here is some data from pmap using move_pages() API  with optimization.
> Following table compares time pmap takes to print address mapping of a
> large process, with numa node information using move_pages() api vs pmap
> using /proc numa_vamaps file.
> 
> Running pmap command on a process with 1.3 TB of address space, with
> sparse mappings.
> 
>                        ~1.3 TB sparse      250G dense segment with hugepages.
> move_pages              8.33s              3.14
> optimized move_pages    6.29s              0.92
> /proc numa_vamaps       0.08s              0.04
> 
>  
> Second column is pmap time on a 250G address range of this process, which maps
> hugepages(THP & hugetlb).

The data look compelling to me.  numa_vmap provides a much smoother user experience
for the analyst who is casting a wide net looking for the root of a performance issue.
Almost no waiting to see the data.

- Steve
Prakash Sangappa Dec. 18, 2018, 11:46 p.m. UTC | #13
On 11/26/2018 11:20 AM, Steven Sistare wrote:
> On 11/9/2018 11:48 PM, Prakash Sangappa wrote:
>>
>> Here is some data from pmap using move_pages() API  with optimization.
>> Following table compares time pmap takes to print address mapping of a
>> large process, with numa node information using move_pages() api vs pmap
>> using /proc numa_vamaps file.
>>
>> Running pmap command on a process with 1.3 TB of address space, with
>> sparse mappings.
>>
>>                         ~1.3 TB sparse      250G dense segment with hugepages.
>> move_pages              8.33s              3.14
>> optimized move_pages    6.29s              0.92
>> /proc numa_vamaps       0.08s              0.04
>>
>>   
>> Second column is pmap time on a 250G address range of this process, which maps
>> hugepages(THP & hugetlb).
> The data look compelling to me.  numa_vmap provides a much smoother user experience
> for the analyst who is casting a wide net looking for the root of a performance issue.
> Almost no waiting to see the data.
>
> - Steve

What do others think? How to proceed on this?

Summarizing the discussion so far:

Usecase for getting VA(Virtual Address) to numa node information is
for performance analysis purpose. Investigating  performance issues
would involve looking at where a process memory is allocated from
(which numa node). For the user analyzing the issue, an efficient way
to get this information will be useful when looking at application
processes having large address space.

The patch proposed  adding /proc/<pid>/numa_vamaps file for providing
VA to Numa node id mapping information of a process. This file provides
address range to numa node id info. Address range not having any pages
mapped will be indicated with '-' for numa node id. Sample file content

00400000-00410000 N1
00410000-0047f000 N0
00480000-00481000 -
00481000-004a0000 N0
..

Dave Hansen asked how would it scale, with respect reading this file from
a large process. Answer is, the file contents are generated using page
table walk, and copied to user buffer. The mmap_sem lock is drop and
re-acquired in the process of walking the page table and copying file
content. The kernel buffer size used determines how long the lock is held.
Which can be further improved to drop the lock and re-acquire after a
fixed number(512) of pages are walked.

Also, with support for seeking to a specific VA of the process from where
the VA to numa node information will be provided, the file offset is not
taken into consideration. This behavior is different and unlike reading a
normal file. Other /proc files(Ex /proc/<pid>/pagemap) also have certain
differences compared to reading a normal file.

Michal Hocko suggested that the currently available 'move_pages' API
could be used to collect the VA to numa node id information. However,
use of numa_vamaps /proc file will be more efficient then move_pages().
Steven Sistare Suggested optimizing move_pages(), for the case when
consecutive 4k page  addresses are passed in. I tried out this optimization
and above mentioned table shows  performance comparison of
move_pages() API vs 'numa_vamaps' /proc file. Specifically, in the case of
sparse mapping the optimization to move_pages() does not help. The
performance benefits seen with the /proc file will make a difference from
an usability point of view.

Andrew Morton had asked about the performance difference between
move_pages() API and use of 'numa_vamaps' /proc file, also the usecase
for getting VA to numa node id information. Hope above description
answers the questions.
<html>
  <head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <p><br>
    </p>
    <br>
    <div class="moz-cite-prefix">On 11/26/2018 11:20 AM, Steven Sistare
      wrote:<br>
    </div>
    <blockquote
      cite="mid:79d5e991-d9f6-65e2-cb77-0f999fa512fe@oracle.com"
      type="cite">
      <pre wrap="">On 11/9/2018 11:48 PM, Prakash Sangappa wrote:
</pre>
      <blockquote type="cite"><br>
        <pre wrap="">Here is some data from pmap using move_pages() API  with optimization.
Following table compares time pmap takes to print address mapping of a
large process, with numa node information using move_pages() api vs pmap
using /proc numa_vamaps file.

Running pmap command on a process with 1.3 TB of address space, with
sparse mappings.

                       ~1.3 TB sparse      250G dense segment with hugepages.
move_pages              8.33s              3.14
optimized move_pages    6.29s              0.92
/proc numa_vamaps       0.08s              0.04

 
Second column is pmap time on a 250G address range of this process, which maps
hugepages(THP &amp; hugetlb).
</pre>
      </blockquote>
      <pre wrap="">
The data look compelling to me.  numa_vmap provides a much smoother user experience
for the analyst who is casting a wide net looking for the root of a performance issue.
Almost no waiting to see the data.

- Steve
</pre>
    </blockquote>
    <br>
    What do others think? How to proceed on this?<br>
    <br>
    Summarizing the discussion so far:<br>
    <br>
    Usecase for getting VA(Virtual Address) to numa node information is
    <br>
    for performance
    analysis purpose. Investigating  performance issues<br>
    would 
    involve looking at where a process memory is allocated from<br>
    (which numa node). For the user analyzing the issue, an efficient
    way <br>
    to get this information will be useful when looking at application <br>
    processes having large address space.<br>
    <br>
    The patch proposed  adding /proc/&lt;pid&gt;/numa_vamaps file for
    providing<br>
    VA to Numa node id mapping information of a process. This file
    provides <br>
    address range to numa node id info. Address range not having any
    pages <br>
    mapped will be indicated with '-' for numa node id. Sample file
    content<br>
    <pre class="content">00400000-00410000 N1
00410000-0047f000 N0
00480000-00481000 -
00481000-004a0000 N0
..
</pre>
    Dave Hansen asked how would it scale, with respect reading this file
    from<br>
    a large process. Answer is, the file contents are generated using
    page<br>
    table walk, and copied to user buffer. The mmap_sem lock is drop and
    <br>
    re-acquired in the process of walking the page table and copying
    file <br>
    content. The kernel buffer size used determines how long the lock is
    held. <br>
    Which can be further improved to drop the lock and re-acquire after
    a <br>
    fixed number(512) of pages are walked.<br>
    <br>
    Also, with support for seeking to a specific VA of the process from
    where<br>
    the VA to numa node information will be provided, the file offset is
    not<br>
    taken into consideration. This behavior is different and unlike
    reading a<br>
    normal file. Other /proc files(Ex /proc/&lt;pid&gt;/pagemap) also
    have certain <br>
    differences compared to reading a normal file.<br>
    <br>
    Michal Hocko suggested that the currently available 'move_pages' API<br>
    could be used to collect the VA to numa node id information.
    However,<br>
    use of numa_vamaps /proc file will be more efficient then
    move_pages().<br>
    Steven Sistare Suggested optimizing move_pages(), for the case when
    <br>
    consecutive 4k page  addresses are passed in. I tried out this
    optimization <br>
    and above mentioned table shows  performance comparison of<br>
    move_pages() API vs 'numa_vamaps' /proc file. Specifically, in the
    case of <br>
    sparse mapping the optimization to move_pages() does not help. The<br>
    performance benefits seen with the /proc file will make a difference
    from <br>
    an usability point of view. <br>
    <br>
    Andrew Morton had asked about the performance difference between <br>
    move_pages() API and use of 'numa_vamaps' /proc file, also the
    usecase <br>
    for getting VA to numa node id information. Hope above description<br>
    answers the questions. <br>
    <br>
    <br>
    <br>
    <br>
    <br>
    <br>
    <br>
  </body>
</html>
Michal Hocko Dec. 19, 2018, 8:52 p.m. UTC | #14
On Tue 18-12-18 15:46:45, prakash.sangappa wrote:
[...]
> Dave Hansen asked how would it scale, with respect reading this file from
> a large process. Answer is, the file contents are generated using page
> table walk, and copied to user buffer. The mmap_sem lock is drop and
> re-acquired in the process of walking the page table and copying file
> content. The kernel buffer size used determines how long the lock is held.
> Which can be further improved to drop the lock and re-acquire after a
> fixed number(512) of pages are walked.

I guess you are still missing the point here. Have you tried a larger
mapping with interleaved memory policy? I would bet my hat that you are
going to spend a large part of the time just pushing the output to the
userspace... Not to mention the parsing on the consumer side.

Also you keep failing (IMO) explaining _who_ is going to be the consumer
of the file. What kind of analysis will need such an optimized data
collection and what can you do about that?

This is really _essential_ when adding a new interface to provide a data
that is already available by other means. In other words tell us your
specific usecase that is hitting a bottleneck that cannot be handled by
the existing API and we can start considering a new one.