Message ID | 1536783844-4145-1-git-send-email-prakash.sangappa@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | VA to numa node information | expand |
On Wed 12-09-18 13:23:58, Prakash Sangappa wrote: > For analysis purpose it is useful to have numa node information > corresponding mapped virtual address ranges of a process. Currently, > the file /proc/<pid>/numa_maps provides list of numa nodes from where pages > are allocated per VMA of a process. This is not useful if an user needs to > determine which numa node the mapped pages are allocated from for a > particular address range. It would have helped if the numa node information > presented in /proc/<pid>/numa_maps was broken down by VA ranges showing the > exact numa node from where the pages have been allocated. > > The format of /proc/<pid>/numa_maps file content is dependent on > /proc/<pid>/maps file content as mentioned in the manpage. i.e one line > entry for every VMA corresponding to entries in /proc/<pids>/maps file. > Therefore changing the output of /proc/<pid>/numa_maps may not be possible. > > This patch set introduces the file /proc/<pid>/numa_vamaps which > will provide proper break down of VA ranges by numa node id from where the > mapped pages are allocated. For Address ranges not having any pages mapped, > a '-' is printed instead of numa node id. > > Includes support to lseek, allowing seeking to a specific process Virtual > address(VA) starting from where the address range to numa node information > can to be read from this file. > > The new file /proc/<pid>/numa_vamaps will be governed by ptrace access > mode PTRACE_MODE_READ_REALCREDS. > > See following for previous discussion about this proposal > > https://marc.info/?t=152524073400001&r=1&w=2 It would be really great to give a short summary of the previous discussion. E.g. why do we need a proc interface in the first place when we already have an API to query for the information you are proposing to export [1] [1] http://lkml.kernel.org/r/20180503085741.GD4535@dhcp22.suse.cz
On 09/13/2018 01:40 AM, Michal Hocko wrote: > On Wed 12-09-18 13:23:58, Prakash Sangappa wrote: >> For analysis purpose it is useful to have numa node information >> corresponding mapped virtual address ranges of a process. Currently, >> the file /proc/<pid>/numa_maps provides list of numa nodes from where pages >> are allocated per VMA of a process. This is not useful if an user needs to >> determine which numa node the mapped pages are allocated from for a >> particular address range. It would have helped if the numa node information >> presented in /proc/<pid>/numa_maps was broken down by VA ranges showing the >> exact numa node from where the pages have been allocated. >> >> The format of /proc/<pid>/numa_maps file content is dependent on >> /proc/<pid>/maps file content as mentioned in the manpage. i.e one line >> entry for every VMA corresponding to entries in /proc/<pids>/maps file. >> Therefore changing the output of /proc/<pid>/numa_maps may not be possible. >> >> This patch set introduces the file /proc/<pid>/numa_vamaps which >> will provide proper break down of VA ranges by numa node id from where the >> mapped pages are allocated. For Address ranges not having any pages mapped, >> a '-' is printed instead of numa node id. >> >> Includes support to lseek, allowing seeking to a specific process Virtual >> address(VA) starting from where the address range to numa node information >> can to be read from this file. >> >> The new file /proc/<pid>/numa_vamaps will be governed by ptrace access >> mode PTRACE_MODE_READ_REALCREDS. >> >> See following for previous discussion about this proposal >> >> https://marc.info/?t=152524073400001&r=1&w=2 > It would be really great to give a short summary of the previous > discussion. E.g. why do we need a proc interface in the first place when > we already have an API to query for the information you are proposing to > export [1] > > [1] http://lkml.kernel.org/r/20180503085741.GD4535@dhcp22.suse.cz The proc interface provides an efficient way to export address range to numa node id mapping information compared to using the API. For example, for sparsely populated mappings, if a VMA has large portions not have any physical pages mapped, the page walk done thru the /proc file interface can skip over non existent PMDs / ptes. Whereas using the API the application would have to scan the entire VMA in page size units. Also, VMAs having THP pages can have a mix of 4k pages and hugepages. The page walks would be efficient in scanning and determining if it is a THP huge page and step over it. Whereas using the API, the application would not know what page size mapping is used for a given VA and so would have to again scan the VMA in units of 4k page size. If this sounds reasonable, I can add it to the commit / patch description. -Prakash.
On Thu, 13 Sep 2018 15:32:25 -0700 "prakash.sangappa" <prakash.sangappa@oracle.com> wrote: > >> https://marc.info/?t=152524073400001&r=1&w=2 > > It would be really great to give a short summary of the previous > > discussion. E.g. why do we need a proc interface in the first place when > > we already have an API to query for the information you are proposing to > > export [1] > > > > [1] http://lkml.kernel.org/r/20180503085741.GD4535@dhcp22.suse.cz > > The proc interface provides an efficient way to export address range > to numa node id mapping information compared to using the API. > For example, for sparsely populated mappings, if a VMA has large portions > not have any physical pages mapped, the page walk done thru the /proc file > interface can skip over non existent PMDs / ptes. Whereas using the > API the application would have to scan the entire VMA in page size units. > > Also, VMAs having THP pages can have a mix of 4k pages and hugepages. > The page walks would be efficient in scanning and determining if it is > a THP huge page and step over it. Whereas using the API, the application > would not know what page size mapping is used for a given VA and so would > have to again scan the VMA in units of 4k page size. > > If this sounds reasonable, I can add it to the commit / patch description. Preferably with some runtime measurements, please. How much faster is this interface in real-world situations? And why does that performance matter? It would also be useful to see more details on how this info helps operators understand/tune/etc their applications and workloads. In other words, I'm trying to get an understanding of how useful this code might be to our users in general.
On 09/13/2018 05:10 PM, Andrew Morton wrote: >> Also, VMAs having THP pages can have a mix of 4k pages and hugepages. >> The page walks would be efficient in scanning and determining if it is >> a THP huge page and step over it. Whereas using the API, the application >> would not know what page size mapping is used for a given VA and so would >> have to again scan the VMA in units of 4k page size. >> >> If this sounds reasonable, I can add it to the commit / patch description. As we are judging whether this is a "good" interface, can you tell us a bit about its scalability? For instance, let's say someone has a 1TB VMA that's populated with interleaved 4k pages. How much data comes out? How long does it take to parse? Will we effectively deadlock the system if someone accidentally cat's the wrong /proc file? /proc seems like a really simple way to implement this, but it seems a *really* odd choice for something that needs to collect a large amount of data. The lseek() stuff is a nice addition, but I wonder if it's unwieldy to use in practice. For instance, if you want to read data for the VMA at 0x1000000 you lseek(fd, 0x1000000, SEEK_SET, right? You read ~20 bytes of data and then the fd is at 0x1000020. But, you're getting data out at the next read() for (at least) the next page, which is also available at 0x1001000. Seems funky. Do other /proc files behave this way?
On Thu 13-09-18 15:32:25, prakash.sangappa wrote: > > > On 09/13/2018 01:40 AM, Michal Hocko wrote: > > On Wed 12-09-18 13:23:58, Prakash Sangappa wrote: > > > For analysis purpose it is useful to have numa node information > > > corresponding mapped virtual address ranges of a process. Currently, > > > the file /proc/<pid>/numa_maps provides list of numa nodes from where pages > > > are allocated per VMA of a process. This is not useful if an user needs to > > > determine which numa node the mapped pages are allocated from for a > > > particular address range. It would have helped if the numa node information > > > presented in /proc/<pid>/numa_maps was broken down by VA ranges showing the > > > exact numa node from where the pages have been allocated. > > > > > > The format of /proc/<pid>/numa_maps file content is dependent on > > > /proc/<pid>/maps file content as mentioned in the manpage. i.e one line > > > entry for every VMA corresponding to entries in /proc/<pids>/maps file. > > > Therefore changing the output of /proc/<pid>/numa_maps may not be possible. > > > > > > This patch set introduces the file /proc/<pid>/numa_vamaps which > > > will provide proper break down of VA ranges by numa node id from where the > > > mapped pages are allocated. For Address ranges not having any pages mapped, > > > a '-' is printed instead of numa node id. > > > > > > Includes support to lseek, allowing seeking to a specific process Virtual > > > address(VA) starting from where the address range to numa node information > > > can to be read from this file. > > > > > > The new file /proc/<pid>/numa_vamaps will be governed by ptrace access > > > mode PTRACE_MODE_READ_REALCREDS. > > > > > > See following for previous discussion about this proposal > > > > > > https://marc.info/?t=152524073400001&r=1&w=2 > > It would be really great to give a short summary of the previous > > discussion. E.g. why do we need a proc interface in the first place when > > we already have an API to query for the information you are proposing to > > export [1] > > > > [1] http://lkml.kernel.org/r/20180503085741.GD4535@dhcp22.suse.cz > > The proc interface provides an efficient way to export address range > to numa node id mapping information compared to using the API. Do you have any numbers? > For example, for sparsely populated mappings, if a VMA has large portions > not have any physical pages mapped, the page walk done thru the /proc file > interface can skip over non existent PMDs / ptes. Whereas using the > API the application would have to scan the entire VMA in page size units. What prevents you from pre-filtering by reading /proc/$pid/maps to get ranges of interest? > Also, VMAs having THP pages can have a mix of 4k pages and hugepages. > The page walks would be efficient in scanning and determining if it is > a THP huge page and step over it. Whereas using the API, the application > would not know what page size mapping is used for a given VA and so would > have to again scan the VMA in units of 4k page size. Why does this matter for something that is for analysis purposes. Reading the file for the whole address space is far from a free operation. Is the page walk optimization really essential for usability? Moreover what prevents move_pages implementation to be clever for the page walk itself? In other words why would we want to add a new API rather than make the existing one faster for everybody. > If this sounds reasonable, I can add it to the commit / patch description. This all is absolutely _essential_ for any new API proposed. Remember that once we add a new user interface, we have to maintain it for ever. We used to be too relaxed when adding new proc files in the past and it backfired many times already.
On 9/14/2018 1:56 AM, Michal Hocko wrote: > On Thu 13-09-18 15:32:25, prakash.sangappa wrote: >> On 09/13/2018 01:40 AM, Michal Hocko wrote: >>> On Wed 12-09-18 13:23:58, Prakash Sangappa wrote: >>>> For analysis purpose it is useful to have numa node information >>>> corresponding mapped virtual address ranges of a process. Currently, >>>> the file /proc/<pid>/numa_maps provides list of numa nodes from where pages >>>> are allocated per VMA of a process. This is not useful if an user needs to >>>> determine which numa node the mapped pages are allocated from for a >>>> particular address range. It would have helped if the numa node information >>>> presented in /proc/<pid>/numa_maps was broken down by VA ranges showing the >>>> exact numa node from where the pages have been allocated. >>>> >>>> The format of /proc/<pid>/numa_maps file content is dependent on >>>> /proc/<pid>/maps file content as mentioned in the manpage. i.e one line >>>> entry for every VMA corresponding to entries in /proc/<pids>/maps file. >>>> Therefore changing the output of /proc/<pid>/numa_maps may not be possible. >>>> >>>> This patch set introduces the file /proc/<pid>/numa_vamaps which >>>> will provide proper break down of VA ranges by numa node id from where the >>>> mapped pages are allocated. For Address ranges not having any pages mapped, >>>> a '-' is printed instead of numa node id. >>>> >>>> Includes support to lseek, allowing seeking to a specific process Virtual >>>> address(VA) starting from where the address range to numa node information >>>> can to be read from this file. >>>> >>>> The new file /proc/<pid>/numa_vamaps will be governed by ptrace access >>>> mode PTRACE_MODE_READ_REALCREDS. >>>> >>>> See following for previous discussion about this proposal >>>> >>>> https://marc.info/?t=152524073400001&r=1&w=2 >>> It would be really great to give a short summary of the previous >>> discussion. E.g. why do we need a proc interface in the first place when >>> we already have an API to query for the information you are proposing to >>> export [1] >>> >>> [1] http://lkml.kernel.org/r/20180503085741.GD4535@dhcp22.suse.cz >> >> The proc interface provides an efficient way to export address range >> to numa node id mapping information compared to using the API. > > Do you have any numbers? > >> For example, for sparsely populated mappings, if a VMA has large portions >> not have any physical pages mapped, the page walk done thru the /proc file >> interface can skip over non existent PMDs / ptes. Whereas using the >> API the application would have to scan the entire VMA in page size units. > > What prevents you from pre-filtering by reading /proc/$pid/maps to get > ranges of interest? That works for skipping holes, but not for skipping huge pages. I did a quick experiment to time move_pages on a 3 GHz Xeon and a 4.18 kernel. Allocate 128 GB and touch every small page. Call move_pages with nodes=NULL to get the node id for all pages, passing 512 consecutive small pages per call to move_nodes. The total move_nodes time is 1.85 secs, and 55 nsec per page. Extrapolating to a 1 TB range, it would take 15 sec to retrieve the numa node for every small page in the range. That is not terrible, but it is not interactive, and it becomes terrible for multiple TB. >> Also, VMAs having THP pages can have a mix of 4k pages and hugepages. >> The page walks would be efficient in scanning and determining if it is >> a THP huge page and step over it. Whereas using the API, the application >> would not know what page size mapping is used for a given VA and so would >> have to again scan the VMA in units of 4k page size. > > Why does this matter for something that is for analysis purposes. > Reading the file for the whole address space is far from a free > operation. Is the page walk optimization really essential for usability? > Moreover what prevents move_pages implementation to be clever for the > page walk itself? In other words why would we want to add a new API > rather than make the existing one faster for everybody. One could optimize move pages. If the caller passes a consecutive range of small pages, and the page walk sees that a VA is mapped by a huge page, then it can return the same numa node for each of the following VA's that fall into the huge page range. It would be faster than 55 nsec per small page, but hard to say how much faster, and the cost is still driven by the number of small pages. >> If this sounds reasonable, I can add it to the commit / patch description. > > This all is absolutely _essential_ for any new API proposed. Remember that > once we add a new user interface, we have to maintain it for ever. We > used to be too relaxed when adding new proc files in the past and it > backfired many times already. An offhand idea -- we could extend /proc/pid/numa_maps in a backward compatible way by providing a control interface that is poked via write() or ioctl(). Provide one control "do-not-combine". If do-not-combine has been set, then the read() function returns a separate line for each range of memory mapped on the same numa node, in the existing format. - Steve
On 9/14/18 9:01 AM, Steven Sistare wrote: > On 9/14/2018 1:56 AM, Michal Hocko wrote: >> On Thu 13-09-18 15:32:25, prakash.sangappa wrote: >>> >>> The proc interface provides an efficient way to export address range >>> to numa node id mapping information compared to using the API. >> Do you have any numbers? >> >>> For example, for sparsely populated mappings, if a VMA has large portions >>> not have any physical pages mapped, the page walk done thru the /proc file >>> interface can skip over non existent PMDs / ptes. Whereas using the >>> API the application would have to scan the entire VMA in page size units. >> What prevents you from pre-filtering by reading /proc/$pid/maps to get >> ranges of interest? > That works for skipping holes, but not for skipping huge pages. I did a > quick experiment to time move_pages on a 3 GHz Xeon and a 4.18 kernel. > Allocate 128 GB and touch every small page. Call move_pages with nodes=NULL > to get the node id for all pages, passing 512 consecutive small pages per > call to move_nodes. The total move_nodes time is 1.85 secs, and 55 nsec > per page. Extrapolating to a 1 TB range, it would take 15 sec to retrieve > the numa node for every small page in the range. That is not terrible, but > it is not interactive, and it becomes terrible for multiple TB. > Also, for valid VMAs in 'maps' file, if the VMA is sparsely populated with physical pages, the page walk can skip over non existing page table entires (PMDs) and so can be faster. For example reading va range of a 400GB VMA which has few pages mapped in beginning and few pages at the end and the rest of VMA does not have any pages, it takes 0.001s using the /proc interface. Whereas with move_page() api passing 1024 consecutive small pages address, it takes about 2.4secs. This is on a similar system running 4.19 kernel.
On 09/14/2018 11:04 AM, Prakash Sangappa wrote: > Also, for valid VMAs in 'maps' file, if the VMA is sparsely > populated with physical pages, the page walk can skip over non > existing page table entires (PMDs) and so can be faster. Note that this only works for things that were _never_ populated. They might be sparse after once being populated and then being reclaimed or discarded. Those will still have all the page tables allocated.
On 9/13/2018 5:25 PM, Dave Hansen wrote: > On 09/13/2018 05:10 PM, Andrew Morton wrote: >>> Also, VMAs having THP pages can have a mix of 4k pages and hugepages. >>> The page walks would be efficient in scanning and determining if it is >>> a THP huge page and step over it. Whereas using the API, the application >>> would not know what page size mapping is used for a given VA and so would >>> have to again scan the VMA in units of 4k page size. >>> >>> If this sounds reasonable, I can add it to the commit / patch description. > As we are judging whether this is a "good" interface, can you tell us a > bit about its scalability? For instance, let's say someone has a 1TB > VMA that's populated with interleaved 4k pages. How much data comes > out? How long does it take to parse? Will we effectively deadlock the > system if someone accidentally cat's the wrong /proc file? For the worst case scenario you describe, it would be one line(range) for each 4k. Which would be similar to what you get with '/proc/*/pagemap'. The amount of data copied out at a time is based on the buffer size used in the kernel. Which is 1024. That is if one line(one range) printed is about 40 bytes(char), that means about 25 lines per copy out. Main concern would be holding 'mmap_sem' lock, which can cause hangs. When the 1024 buffer gets filled the mmap_sem is dropped and the buffer content is copied out to the user buffer. Then the mmap_sem lock is reacquired and the page walk continues as needed until the specified user buffer size is filed or till end of process address space is reached. One potential issue could be that there is a large VA range with all pages populated from one numa node, then the page walk could take longer while holding mmap_sem lock. This can be addressed by dropping and re-acquiring the mmap_sem lock after certain number of pages have been walked(Say 512 - which is what happens in '/proc/*/pagemap' case). > > /proc seems like a really simple way to implement this, but it seems a > *really* odd choice for something that needs to collect a large amount > of data. The lseek() stuff is a nice addition, but I wonder if it's > unwieldy to use in practice. For instance, if you want to read data for > the VMA at 0x1000000 you lseek(fd, 0x1000000, SEEK_SET, right? You read > ~20 bytes of data and then the fd is at 0x1000020. But, you're getting > data out at the next read() for (at least) the next page, which is also > available at 0x1001000. Seems funky. Do other /proc files behave this way? > Yes, SEEK_SET to the VA. The lseek offset is the process VA. So it is not going to be different from reading a normal text file. Expect that /proc files are special. Ex In /proc/*/pagemap' file case, read enforces that seek/file offset and the user buffer size passed in to be a multiple of the pagemap_entry_t size or else the read would fail. The usage for numa_vamaps file will be to SEEK_SET to the VA from where VA range to numa node information needs to be read. The 'fd' offset is not taken into consideration here, just the VA. Say each va range to numa node id printed is about 40 bytes(chars). Now if the read only read 20 bytes, it would have read part of the line. Subsequent read would read the remaining bytes of the line, which will be stored in the kernel buffer.
On Fri 14-09-18 12:01:18, Steven Sistare wrote: > On 9/14/2018 1:56 AM, Michal Hocko wrote: [...] > > Why does this matter for something that is for analysis purposes. > > Reading the file for the whole address space is far from a free > > operation. Is the page walk optimization really essential for usability? > > Moreover what prevents move_pages implementation to be clever for the > > page walk itself? In other words why would we want to add a new API > > rather than make the existing one faster for everybody. > > One could optimize move pages. If the caller passes a consecutive range > of small pages, and the page walk sees that a VA is mapped by a huge page, > then it can return the same numa node for each of the following VA's that fall > into the huge page range. It would be faster than 55 nsec per small page, but > hard to say how much faster, and the cost is still driven by the number of > small pages. This is exactly what I was arguing for. There is some room for improvements for the existing interface. I yet have to hear the explicit usecase which would required even better performance that cannot be achieved by the existing API.
On 9/24/18 10:14 AM, Michal Hocko wrote: > On Fri 14-09-18 12:01:18, Steven Sistare wrote: >> On 9/14/2018 1:56 AM, Michal Hocko wrote: > [...] >>> Why does this matter for something that is for analysis purposes. >>> Reading the file for the whole address space is far from a free >>> operation. Is the page walk optimization really essential for usability? >>> Moreover what prevents move_pages implementation to be clever for the >>> page walk itself? In other words why would we want to add a new API >>> rather than make the existing one faster for everybody. >> One could optimize move pages. If the caller passes a consecutive range >> of small pages, and the page walk sees that a VA is mapped by a huge page, >> then it can return the same numa node for each of the following VA's that fall >> into the huge page range. It would be faster than 55 nsec per small page, but >> hard to say how much faster, and the cost is still driven by the number of >> small pages. > This is exactly what I was arguing for. There is some room for > improvements for the existing interface. I yet have to hear the explicit > usecase which would required even better performance that cannot be > achieved by the existing API. > Above mentioned optimization to move_pages() API helps when scanning mapped huge pages, but does not help if there are large sparse mappings with few pages mapped. Otherwise, consider adding page walk support in the move_pages() implementation, enhance the API(new flag?) to return address range to numa node information. The page walk optimization would certainly make a difference for usability. We can have applications(Like Oracle DB) having processes with large sparse mappings(in TBs) with only some areas of these mapped address range being accessed, basically large portions not having page tables backing it. This can become more prevalent on newer systems with multiple TBs of memory. Here is some data from pmap using move_pages() API with optimization. Following table compares time pmap takes to print address mapping of a large process, with numa node information using move_pages() api vs pmap using /proc numa_vamaps file. Running pmap command on a process with 1.3 TB of address space, with sparse mappings. ~1.3 TB sparse 250G dense segment with hugepages. move_pages 8.33s 3.14 optimized move_pages 6.29s 0.92 /proc numa_vamaps 0.08s 0.04 Second column is pmap time on a 250G address range of this process, which maps hugepages(THP & hugetlb).
On 11/9/2018 11:48 PM, Prakash Sangappa wrote: > On 9/24/18 10:14 AM, Michal Hocko wrote: >> On Fri 14-09-18 12:01:18, Steven Sistare wrote: >>> On 9/14/2018 1:56 AM, Michal Hocko wrote: >> [...] >>>> Why does this matter for something that is for analysis purposes. >>>> Reading the file for the whole address space is far from a free >>>> operation. Is the page walk optimization really essential for usability? >>>> Moreover what prevents move_pages implementation to be clever for the >>>> page walk itself? In other words why would we want to add a new API >>>> rather than make the existing one faster for everybody. >>> One could optimize move pages. If the caller passes a consecutive range >>> of small pages, and the page walk sees that a VA is mapped by a huge page, >>> then it can return the same numa node for each of the following VA's that fall >>> into the huge page range. It would be faster than 55 nsec per small page, but >>> hard to say how much faster, and the cost is still driven by the number of >>> small pages. >> This is exactly what I was arguing for. There is some room for >> improvements for the existing interface. I yet have to hear the explicit >> usecase which would required even better performance that cannot be >> achieved by the existing API. >> > > Above mentioned optimization to move_pages() API helps when scanning > mapped huge pages, but does not help if there are large sparse mappings > with few pages mapped. Otherwise, consider adding page walk support in > the move_pages() implementation, enhance the API(new flag?) to return > address range to numa node information. The page walk optimization > would certainly make a difference for usability. > > We can have applications(Like Oracle DB) having processes with large sparse > mappings(in TBs) with only some areas of these mapped address range > being accessed, basically large portions not having page tables backing it. > This can become more prevalent on newer systems with multiple TBs of > memory. > > Here is some data from pmap using move_pages() API with optimization. > Following table compares time pmap takes to print address mapping of a > large process, with numa node information using move_pages() api vs pmap > using /proc numa_vamaps file. > > Running pmap command on a process with 1.3 TB of address space, with > sparse mappings. > > ~1.3 TB sparse 250G dense segment with hugepages. > move_pages 8.33s 3.14 > optimized move_pages 6.29s 0.92 > /proc numa_vamaps 0.08s 0.04 > > > Second column is pmap time on a 250G address range of this process, which maps > hugepages(THP & hugetlb). The data look compelling to me. numa_vmap provides a much smoother user experience for the analyst who is casting a wide net looking for the root of a performance issue. Almost no waiting to see the data. - Steve
On 11/26/2018 11:20 AM, Steven Sistare wrote: > On 11/9/2018 11:48 PM, Prakash Sangappa wrote: >> >> Here is some data from pmap using move_pages() API with optimization. >> Following table compares time pmap takes to print address mapping of a >> large process, with numa node information using move_pages() api vs pmap >> using /proc numa_vamaps file. >> >> Running pmap command on a process with 1.3 TB of address space, with >> sparse mappings. >> >> ~1.3 TB sparse 250G dense segment with hugepages. >> move_pages 8.33s 3.14 >> optimized move_pages 6.29s 0.92 >> /proc numa_vamaps 0.08s 0.04 >> >> >> Second column is pmap time on a 250G address range of this process, which maps >> hugepages(THP & hugetlb). > The data look compelling to me. numa_vmap provides a much smoother user experience > for the analyst who is casting a wide net looking for the root of a performance issue. > Almost no waiting to see the data. > > - Steve What do others think? How to proceed on this? Summarizing the discussion so far: Usecase for getting VA(Virtual Address) to numa node information is for performance analysis purpose. Investigating performance issues would involve looking at where a process memory is allocated from (which numa node). For the user analyzing the issue, an efficient way to get this information will be useful when looking at application processes having large address space. The patch proposed adding /proc/<pid>/numa_vamaps file for providing VA to Numa node id mapping information of a process. This file provides address range to numa node id info. Address range not having any pages mapped will be indicated with '-' for numa node id. Sample file content 00400000-00410000 N1 00410000-0047f000 N0 00480000-00481000 - 00481000-004a0000 N0 .. Dave Hansen asked how would it scale, with respect reading this file from a large process. Answer is, the file contents are generated using page table walk, and copied to user buffer. The mmap_sem lock is drop and re-acquired in the process of walking the page table and copying file content. The kernel buffer size used determines how long the lock is held. Which can be further improved to drop the lock and re-acquire after a fixed number(512) of pages are walked. Also, with support for seeking to a specific VA of the process from where the VA to numa node information will be provided, the file offset is not taken into consideration. This behavior is different and unlike reading a normal file. Other /proc files(Ex /proc/<pid>/pagemap) also have certain differences compared to reading a normal file. Michal Hocko suggested that the currently available 'move_pages' API could be used to collect the VA to numa node id information. However, use of numa_vamaps /proc file will be more efficient then move_pages(). Steven Sistare Suggested optimizing move_pages(), for the case when consecutive 4k page addresses are passed in. I tried out this optimization and above mentioned table shows performance comparison of move_pages() API vs 'numa_vamaps' /proc file. Specifically, in the case of sparse mapping the optimization to move_pages() does not help. The performance benefits seen with the /proc file will make a difference from an usability point of view. Andrew Morton had asked about the performance difference between move_pages() API and use of 'numa_vamaps' /proc file, also the usecase for getting VA to numa node id information. Hope above description answers the questions. <html> <head> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"> </head> <body text="#000000" bgcolor="#FFFFFF"> <p><br> </p> <br> <div class="moz-cite-prefix">On 11/26/2018 11:20 AM, Steven Sistare wrote:<br> </div> <blockquote cite="mid:79d5e991-d9f6-65e2-cb77-0f999fa512fe@oracle.com" type="cite"> <pre wrap="">On 11/9/2018 11:48 PM, Prakash Sangappa wrote: </pre> <blockquote type="cite"><br> <pre wrap="">Here is some data from pmap using move_pages() API with optimization. Following table compares time pmap takes to print address mapping of a large process, with numa node information using move_pages() api vs pmap using /proc numa_vamaps file. Running pmap command on a process with 1.3 TB of address space, with sparse mappings. ~1.3 TB sparse 250G dense segment with hugepages. move_pages 8.33s 3.14 optimized move_pages 6.29s 0.92 /proc numa_vamaps 0.08s 0.04 Second column is pmap time on a 250G address range of this process, which maps hugepages(THP & hugetlb). </pre> </blockquote> <pre wrap=""> The data look compelling to me. numa_vmap provides a much smoother user experience for the analyst who is casting a wide net looking for the root of a performance issue. Almost no waiting to see the data. - Steve </pre> </blockquote> <br> What do others think? How to proceed on this?<br> <br> Summarizing the discussion so far:<br> <br> Usecase for getting VA(Virtual Address) to numa node information is <br> for performance analysis purpose. Investigating performance issues<br> would involve looking at where a process memory is allocated from<br> (which numa node). For the user analyzing the issue, an efficient way <br> to get this information will be useful when looking at application <br> processes having large address space.<br> <br> The patch proposed adding /proc/<pid>/numa_vamaps file for providing<br> VA to Numa node id mapping information of a process. This file provides <br> address range to numa node id info. Address range not having any pages <br> mapped will be indicated with '-' for numa node id. Sample file content<br> <pre class="content">00400000-00410000 N1 00410000-0047f000 N0 00480000-00481000 - 00481000-004a0000 N0 .. </pre> Dave Hansen asked how would it scale, with respect reading this file from<br> a large process. Answer is, the file contents are generated using page<br> table walk, and copied to user buffer. The mmap_sem lock is drop and <br> re-acquired in the process of walking the page table and copying file <br> content. The kernel buffer size used determines how long the lock is held. <br> Which can be further improved to drop the lock and re-acquire after a <br> fixed number(512) of pages are walked.<br> <br> Also, with support for seeking to a specific VA of the process from where<br> the VA to numa node information will be provided, the file offset is not<br> taken into consideration. This behavior is different and unlike reading a<br> normal file. Other /proc files(Ex /proc/<pid>/pagemap) also have certain <br> differences compared to reading a normal file.<br> <br> Michal Hocko suggested that the currently available 'move_pages' API<br> could be used to collect the VA to numa node id information. However,<br> use of numa_vamaps /proc file will be more efficient then move_pages().<br> Steven Sistare Suggested optimizing move_pages(), for the case when <br> consecutive 4k page addresses are passed in. I tried out this optimization <br> and above mentioned table shows performance comparison of<br> move_pages() API vs 'numa_vamaps' /proc file. Specifically, in the case of <br> sparse mapping the optimization to move_pages() does not help. The<br> performance benefits seen with the /proc file will make a difference from <br> an usability point of view. <br> <br> Andrew Morton had asked about the performance difference between <br> move_pages() API and use of 'numa_vamaps' /proc file, also the usecase <br> for getting VA to numa node id information. Hope above description<br> answers the questions. <br> <br> <br> <br> <br> <br> <br> <br> </body> </html>
On Tue 18-12-18 15:46:45, prakash.sangappa wrote: [...] > Dave Hansen asked how would it scale, with respect reading this file from > a large process. Answer is, the file contents are generated using page > table walk, and copied to user buffer. The mmap_sem lock is drop and > re-acquired in the process of walking the page table and copying file > content. The kernel buffer size used determines how long the lock is held. > Which can be further improved to drop the lock and re-acquire after a > fixed number(512) of pages are walked. I guess you are still missing the point here. Have you tried a larger mapping with interleaved memory policy? I would bet my hat that you are going to spend a large part of the time just pushing the output to the userspace... Not to mention the parsing on the consumer side. Also you keep failing (IMO) explaining _who_ is going to be the consumer of the file. What kind of analysis will need such an optimized data collection and what can you do about that? This is really _essential_ when adding a new interface to provide a data that is already available by other means. In other words tell us your specific usecase that is hitting a bottleneck that cannot be handled by the existing API and we can start considering a new one.