Message ID | 20240704030704.2289667-2-lihongbo22@huawei.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Introduce tracepoint for hugetlbfs | expand |
On Thu, Jul 04, 2024 at 11:07:03AM +0800, Hongbo Li wrote: > + TP_printk("dev = (%d,%d), ino = %lu, dir = %lu, mode = 0%o", > + MAJOR(__entry->dev), MINOR(__entry->dev), > + (unsigned long) __entry->ino, > + (unsigned long) __entry->dir, __entry->mode) erofs and f2fs are the only two places that print devices like this. "dev=%d:%d inode=%lx" Why do we need dir and mode? Actually, why do we need a tracepoint on alloc_inode at all? What does it help us debug, and why does no other filesystem need an alloc_inode tracepoint?
On 2024/7/4 11:37, Matthew Wilcox wrote: > On Thu, Jul 04, 2024 at 11:07:03AM +0800, Hongbo Li wrote: >> + TP_printk("dev = (%d,%d), ino = %lu, dir = %lu, mode = 0%o", >> + MAJOR(__entry->dev), MINOR(__entry->dev), >> + (unsigned long) __entry->ino, >> + (unsigned long) __entry->dir, __entry->mode) > > erofs and f2fs are the only two places that print devices like this. > > "dev=%d:%d inode=%lx" > > Why do we need dir and mode? Thanks for reviewing! Here dir and mode are used to track the creation of the directory tree. > > Actually, why do we need a tracepoint on alloc_inode at all? What > does it help us debug, and why does no other filesystem need an > alloc_inode tracepoint? > In fact, f2fs and ext4 have added this tracepoint such as trace_f2fs_new_inode(in f2fs) and trace_ext4_allocate_inode(in ext4). This can trace the lifecycle of an inode comprehensively. These tracepoints are used to debug some closed application scenarios, and also helping developers to debug the filesystem logic in hugetlbfs. Thanks, Hongbo
On Thu, Jul 04, 2024 at 04:37:30AM +0100, Matthew Wilcox wrote: > On Thu, Jul 04, 2024 at 11:07:03AM +0800, Hongbo Li wrote: > > + TP_printk("dev = (%d,%d), ino = %lu, dir = %lu, mode = 0%o", > > + MAJOR(__entry->dev), MINOR(__entry->dev), > > + (unsigned long) __entry->ino, > > + (unsigned long) __entry->dir, __entry->mode) > > erofs and f2fs are the only two places that print devices like this. > > "dev=%d:%d inode=%lx" "dev %d:%d inode %lx" i.e. every token in the output should be space separated, and no commas between values. Any other format makes it difficult for post processing tracepoint output with sed, grep, awk, python, etc. Every token then has to be split into name and value parts, and then the value has to have the comma stripped from it. Having to do this is additional work when writing use-once scripts that get thrown away when the tracepoint output analysis is done is painful, and it's completely unnecessary if the tracepoint output is completely space separated from the start. -Dave.
On Thu, 4 Jul 2024 22:56:29 +1000 Dave Chinner <david@fromorbit.com> wrote: > Having to do this is additional work when writing use-once scripts > that get thrown away when the tracepoint output analysis is done > is painful, and it's completely unnecessary if the tracepoint output > is completely space separated from the start. If you are using scripts to parse the output, then you could just enable the "fields" options, which will just ignore the TP_printk() and print the fields in both their hex and decimal values: # trace-cmd start -e filemap -O fields // the above fields change can also be done with: // echo 1 > /sys/kernel/tracing/options/fields # trace-cmd show # tracer: nop # # entries-in-buffer/entries-written: 8/8 #P:8 # # _-----=> irqs-off/BH-disabled # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # |||| / delay # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | less-2527 [004] ..... 61949.896458: mm_filemap_add_to_page_cache: pfn=0x144625 (1328677) i_ino=0x335c6 (210374) index=0x0 (0) s_dev=0xfe00003 (266338307) order=(0) less-2527 [004] d..2. 61949.896926: mm_filemap_delete_from_page_cache: pfn=0x152b07 (1387271) i_ino=0x2d73a (186170) index=0x0 (0) s_dev=0xfe00003 (266338307) order=(0) jbd2/vda3-8-268 [005] ..... 61954.461964: mm_filemap_add_to_page_cache: pfn=0x152b70 (1387376) i_ino=0xfe00003 (266338307) index=0x30bd33 (3194163) s_dev=0x3 (3) order=(0) jbd2/vda3-8-268 [005] ..... 61954.462669: mm_filemap_add_to_page_cache: pfn=0x15335b (1389403) i_ino=0xfe00003 (266338307) index=0x30bd40 (3194176) s_dev=0x3 (3) order=(0) jbd2/vda3-8-268 [005] ..... 62001.565391: mm_filemap_add_to_page_cache: pfn=0x13a996 (1288598) i_ino=0xfe00003 (266338307) index=0x30bd41 (3194177) s_dev=0x3 (3) order=(0) jbd2/vda3-8-268 [005] ..... 62001.566081: mm_filemap_add_to_page_cache: pfn=0x1446b5 (1328821) i_ino=0xfe00003 (266338307) index=0x30bd43 (3194179) s_dev=0x3 (3) order=(0) less-2530 [004] ..... 62033.182309: mm_filemap_add_to_page_cache: pfn=0x13d755 (1300309) i_ino=0x2d73a (186170) index=0x0 (0) s_dev=0xfe00003 (266338307) order=(0) less-2530 [004] d..2. 62033.182801: mm_filemap_delete_from_page_cache: pfn=0x144625 (1328677) i_ino=0x335c6 (210374) index=0x0 (0) s_dev=0xfe00003 (266338307) order=(0) Just an FYI, -- Steve
On Thu, Jul 04, 2024 at 10:13:22AM -0400, Steven Rostedt wrote: > On Thu, 4 Jul 2024 22:56:29 +1000 > Dave Chinner <david@fromorbit.com> wrote: > > > Having to do this is additional work when writing use-once scripts > > that get thrown away when the tracepoint output analysis is done > > is painful, and it's completely unnecessary if the tracepoint output > > is completely space separated from the start. > > If you are using scripts to parse the output, then you could just > enable the "fields" options, which will just ignore the TP_printk() and > print the fields in both their hex and decimal values: > > # trace-cmd start -e filemap -O fields > > // the above fields change can also be done with: > // echo 1 > /sys/kernel/tracing/options/fields > > # trace-cmd show > # tracer: nop > # > # entries-in-buffer/entries-written: 8/8 #P:8 > # > # _-----=> irqs-off/BH-disabled > # / _----=> need-resched > # | / _---=> hardirq/softirq > # || / _--=> preempt-depth > # ||| / _-=> migrate-disable > # |||| / delay > # TASK-PID CPU# ||||| TIMESTAMP FUNCTION > # | | | ||||| | | > less-2527 [004] ..... 61949.896458: mm_filemap_add_to_page_cache: pfn=0x144625 (1328677) i_ino=0x335c6 (210374) index=0x0 (0) s_dev=0xfe00003 (266338307) order=(0) > less-2527 [004] d..2. 61949.896926: mm_filemap_delete_from_page_cache: pfn=0x152b07 (1387271) i_ino=0x2d73a (186170) index=0x0 (0) s_dev=0xfe00003 (266338307) order=(0) > jbd2/vda3-8-268 [005] ..... 61954.461964: mm_filemap_add_to_page_cache: pfn=0x152b70 (1387376) i_ino=0xfe00003 (266338307) index=0x30bd33 (3194163) s_dev=0x3 (3) order=(0) > jbd2/vda3-8-268 [005] ..... 61954.462669: mm_filemap_add_to_page_cache: pfn=0x15335b (1389403) i_ino=0xfe00003 (266338307) index=0x30bd40 (3194176) s_dev=0x3 (3) order=(0) > jbd2/vda3-8-268 [005] ..... 62001.565391: mm_filemap_add_to_page_cache: pfn=0x13a996 (1288598) i_ino=0xfe00003 (266338307) index=0x30bd41 (3194177) s_dev=0x3 (3) order=(0) > jbd2/vda3-8-268 [005] ..... 62001.566081: mm_filemap_add_to_page_cache: pfn=0x1446b5 (1328821) i_ino=0xfe00003 (266338307) index=0x30bd43 (3194179) s_dev=0x3 (3) order=(0) > less-2530 [004] ..... 62033.182309: mm_filemap_add_to_page_cache: pfn=0x13d755 (1300309) i_ino=0x2d73a (186170) index=0x0 (0) s_dev=0xfe00003 (266338307) order=(0) > less-2530 [004] d..2. 62033.182801: mm_filemap_delete_from_page_cache: pfn=0x144625 (1328677) i_ino=0x335c6 (210374) index=0x0 (0) s_dev=0xfe00003 (266338307) order=(0) Yes, I know about that. But this just makes things harder, because now there are *3* different formats that have to be handled (i.e. now we also have to strip "()" around numbers). -Dave.
On 2024/7/5 7:34, Dave Chinner wrote: > On Thu, Jul 04, 2024 at 10:13:22AM -0400, Steven Rostedt wrote: >> On Thu, 4 Jul 2024 22:56:29 +1000 >> Dave Chinner <david@fromorbit.com> wrote: >> >>> Having to do this is additional work when writing use-once scripts >>> that get thrown away when the tracepoint output analysis is done >>> is painful, and it's completely unnecessary if the tracepoint output >>> is completely space separated from the start. >> >> If you are using scripts to parse the output, then you could just >> enable the "fields" options, which will just ignore the TP_printk() and >> print the fields in both their hex and decimal values: >> >> # trace-cmd start -e filemap -O fields >> >> // the above fields change can also be done with: >> // echo 1 > /sys/kernel/tracing/options/fields >> >> # trace-cmd show >> # tracer: nop >> # >> # entries-in-buffer/entries-written: 8/8 #P:8 >> # >> # _-----=> irqs-off/BH-disabled >> # / _----=> need-resched >> # | / _---=> hardirq/softirq >> # || / _--=> preempt-depth >> # ||| / _-=> migrate-disable >> # |||| / delay >> # TASK-PID CPU# ||||| TIMESTAMP FUNCTION >> # | | | ||||| | | >> less-2527 [004] ..... 61949.896458: mm_filemap_add_to_page_cache: pfn=0x144625 (1328677) i_ino=0x335c6 (210374) index=0x0 (0) s_dev=0xfe00003 (266338307) order=(0) >> less-2527 [004] d..2. 61949.896926: mm_filemap_delete_from_page_cache: pfn=0x152b07 (1387271) i_ino=0x2d73a (186170) index=0x0 (0) s_dev=0xfe00003 (266338307) order=(0) >> jbd2/vda3-8-268 [005] ..... 61954.461964: mm_filemap_add_to_page_cache: pfn=0x152b70 (1387376) i_ino=0xfe00003 (266338307) index=0x30bd33 (3194163) s_dev=0x3 (3) order=(0) >> jbd2/vda3-8-268 [005] ..... 61954.462669: mm_filemap_add_to_page_cache: pfn=0x15335b (1389403) i_ino=0xfe00003 (266338307) index=0x30bd40 (3194176) s_dev=0x3 (3) order=(0) >> jbd2/vda3-8-268 [005] ..... 62001.565391: mm_filemap_add_to_page_cache: pfn=0x13a996 (1288598) i_ino=0xfe00003 (266338307) index=0x30bd41 (3194177) s_dev=0x3 (3) order=(0) >> jbd2/vda3-8-268 [005] ..... 62001.566081: mm_filemap_add_to_page_cache: pfn=0x1446b5 (1328821) i_ino=0xfe00003 (266338307) index=0x30bd43 (3194179) s_dev=0x3 (3) order=(0) >> less-2530 [004] ..... 62033.182309: mm_filemap_add_to_page_cache: pfn=0x13d755 (1300309) i_ino=0x2d73a (186170) index=0x0 (0) s_dev=0xfe00003 (266338307) order=(0) >> less-2530 [004] d..2. 62033.182801: mm_filemap_delete_from_page_cache: pfn=0x144625 (1328677) i_ino=0x335c6 (210374) index=0x0 (0) s_dev=0xfe00003 (266338307) order=(0) > > Yes, I know about that. But this just makes things harder, because > now there are *3* different formats that have to be handled (i.e. > now we also have to strip "()" around numbers). Perhaps if users want to filter the format, they could enable the "fields" option in a unified manner. As for TP_printk(), it depends on how to better display the data used for debugging. Thanks, Hongbo > > -Dave.
diff --git a/MAINTAINERS b/MAINTAINERS index cd2ca0c3158e..865c48e92d40 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -10188,6 +10188,7 @@ F: Documentation/mm/hugetlbfs_reserv.rst F: Documentation/mm/vmemmap_dedup.rst F: fs/hugetlbfs/ F: include/linux/hugetlb.h +F: include/trace/events/hugetlbfs.h F: mm/hugetlb.c F: mm/hugetlb_vmemmap.c F: mm/hugetlb_vmemmap.h diff --git a/include/trace/events/hugetlbfs.h b/include/trace/events/hugetlbfs.h new file mode 100644 index 000000000000..975f584f6f51 --- /dev/null +++ b/include/trace/events/hugetlbfs.h @@ -0,0 +1,160 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM hugetlbfs + +#if !defined(_TRACE_HUGETLBFS_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_HUGETLBFS_H + +#include <linux/tracepoint.h> + +TRACE_EVENT(hugetlbfs_alloc_inode, + + TP_PROTO(struct inode *inode, struct inode *dir, int mode), + + TP_ARGS(inode, dir, mode), + + TP_STRUCT__entry( + __field(dev_t, dev) + __field(ino_t, ino) + __field(ino_t, dir) + __field(__u16, mode) + ), + + TP_fast_assign( + __entry->dev = inode->i_sb->s_dev; + __entry->ino = inode->i_ino; + __entry->dir = dir->i_ino; + __entry->mode = mode; + ), + + TP_printk("dev = (%d,%d), ino = %lu, dir = %lu, mode = 0%o", + MAJOR(__entry->dev), MINOR(__entry->dev), + (unsigned long) __entry->ino, + (unsigned long) __entry->dir, __entry->mode) +); + +DECLARE_EVENT_CLASS(hugetlbfs__inode, + + TP_PROTO(struct inode *inode), + + TP_ARGS(inode), + + TP_STRUCT__entry( + __field(dev_t, dev) + __field(ino_t, ino) + __field(__u16, mode) + __field(loff_t, size) + __field(unsigned int, nlink) + __field(unsigned int, seals) + __field(blkcnt_t, blocks) + ), + + TP_fast_assign( + __entry->dev = inode->i_sb->s_dev; + __entry->ino = inode->i_ino; + __entry->mode = inode->i_mode; + __entry->size = inode->i_size; + __entry->nlink = inode->i_nlink; + __entry->seals = HUGETLBFS_I(inode)->seals; + __entry->blocks = inode->i_blocks; + ), + + TP_printk("dev = (%d,%d), ino = %lu, i_mode = 0%o, i_size = %lld, i_nlink = %u, seals = %u, i_blocks = %llu", + MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long) __entry->ino, + __entry->mode, __entry->size, __entry->nlink, __entry->seals, + (unsigned long long)__entry->blocks) +); + +DEFINE_EVENT(hugetlbfs__inode, hugetlbfs_evict_inode, + + TP_PROTO(struct inode *inode), + + TP_ARGS(inode) +); + +DEFINE_EVENT(hugetlbfs__inode, hugetlbfs_free_inode, + + TP_PROTO(struct inode *inode), + + TP_ARGS(inode) +); + +TRACE_EVENT(hugetlbfs_setattr, + + TP_PROTO(struct inode *inode, struct dentry *dentry, + struct iattr *attr), + + TP_ARGS(inode, dentry, attr), + + TP_STRUCT__entry( + __field(dev_t, dev) + __field(ino_t, ino) + __field(unsigned int, d_len) + __string(d_name, dentry->d_name.name) + __field(unsigned int, ia_valid) + __field(unsigned int, ia_mode) + __field(unsigned int, ia_uid) + __field(unsigned int, ia_gid) + __field(loff_t, old_size) + __field(loff_t, ia_size) + ), + + TP_fast_assign( + __entry->dev = inode->i_sb->s_dev; + __entry->ino = inode->i_ino; + __entry->d_len = dentry->d_name.len; + __assign_str(d_name); + __entry->ia_valid = attr->ia_valid; + __entry->ia_mode = attr->ia_mode; + __entry->ia_uid = from_kuid_munged(current_user_ns(), attr->ia_uid); + __entry->ia_gid = from_kgid_munged(current_user_ns(), attr->ia_gid); + __entry->old_size = inode->i_size; + __entry->ia_size = attr->ia_size; + ), + + TP_printk("dev = (%d,%d), ino = %lu, name = %.*s, ia_valid = %u, ia_mode = 0%o, ia_uid = %u, ia_gid = %u, old_size = %lld, ia_size = %lld", + MAJOR(__entry->dev), MINOR(__entry->dev), (unsigned long)__entry->ino, + __entry->d_len, __get_str(d_name), __entry->ia_valid, __entry->ia_mode, + __entry->ia_uid, __entry->ia_gid, __entry->old_size, __entry->ia_size) +); + +TRACE_EVENT(hugetlbfs_fallocate, + + TP_PROTO(struct inode *inode, int mode, + loff_t offset, loff_t len, int ret), + + TP_ARGS(inode, mode, offset, len, ret), + + TP_STRUCT__entry( + __field(dev_t, dev) + __field(ino_t, ino) + __field(int, mode) + __field(loff_t, offset) + __field(loff_t, len) + __field(loff_t, size) + __field(int, ret) + ), + + TP_fast_assign( + __entry->dev = inode->i_sb->s_dev; + __entry->ino = inode->i_ino; + __entry->mode = mode; + __entry->offset = offset; + __entry->len = len; + __entry->size = inode->i_size; + __entry->ret = ret; + ), + + TP_printk("dev = (%d,%d), ino = %lu, mode = %x, offset = %lld, len = %lld, i_size = %lld, ret = %d", + MAJOR(__entry->dev), MINOR(__entry->dev), + (unsigned long)__entry->ino, __entry->mode, + (unsigned long long)__entry->offset, + (unsigned long long)__entry->len, + (unsigned long long)__entry->size, + __entry->ret) +); + +#endif /* _TRACE_HUGETLBFS_H */ + + /* This part must be outside protection */ +#include <trace/define_trace.h>
Add basic tracepoints for {alloc, evict, free}_inode, setattr and fallocate. These can help users to debug hugetlbfs more conveniently. Signed-off-by: Hongbo Li <lihongbo22@huawei.com> --- MAINTAINERS | 1 + include/trace/events/hugetlbfs.h | 160 +++++++++++++++++++++++++++++++ 2 files changed, 161 insertions(+) create mode 100644 include/trace/events/hugetlbfs.h