Message ID | 43245b463f00506016b8c39c0252faf62bd73e35.1708709155.git.john@groves.net (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
Series | Introduce the famfs shared-memory file system | expand |
On 2/23/24 09:42, John Groves wrote: > One of the key requirements for famfs is that it service vma faults > efficiently. Our metadata helps - the search order is n for n extents, > and n is usually 1. But we can still observe gnarly lock contention > in mm if PTE faults are happening. This commit introduces fault counters > that can be enabled and read via /sys/fs/famfs/... > > These counters have proved useful in troubleshooting situations where > PTE faults were happening instead of PMD. No performance impact when > disabled. This seems kinda wonky. Why does _this_ specific filesystem need its own fault counters. Seems like something we'd want to do much more generically, if it is needed at all. Was the issue here just that vm_ops->fault() was getting called instead of ->huge_fault()? Or something more subtle?
On 24/02/23 10:23AM, Dave Hansen wrote: > On 2/23/24 09:42, John Groves wrote: > > One of the key requirements for famfs is that it service vma faults > > efficiently. Our metadata helps - the search order is n for n extents, > > and n is usually 1. But we can still observe gnarly lock contention > > in mm if PTE faults are happening. This commit introduces fault counters > > that can be enabled and read via /sys/fs/famfs/... > > > > These counters have proved useful in troubleshooting situations where > > PTE faults were happening instead of PMD. No performance impact when > > disabled. > > This seems kinda wonky. Why does _this_ specific filesystem need its > own fault counters. Seems like something we'd want to do much more > generically, if it is needed at all. > > Was the issue here just that vm_ops->fault() was getting called instead > of ->huge_fault()? Or something more subtle? Thanks for your reply Dave! First, I'm willing to pull the fault counters out if the brain trust doesn't like them. I put them in because we were running benchmarks of computational data analytics and and noted that jobs took 3x as long on famfs as raw dax - which indicated I was doing something wrong, because it should be equivalent or very close. The the solution was to call thp_get_unmapped_area() in famfs_file_operations, and performance doesn't vary significantly from raw dax now. Prior to that I wasn't making sure the mmap address was PMD aligned. After that I wanted a way to be double-secret-certain that it was servicing PMD faults as intended. Which it basically always is, so far. (The smoke tests in user space check this.) John
John Groves wrote: > On 24/02/23 10:23AM, Dave Hansen wrote: > > On 2/23/24 09:42, John Groves wrote: > > > One of the key requirements for famfs is that it service vma faults > > > efficiently. Our metadata helps - the search order is n for n extents, > > > and n is usually 1. But we can still observe gnarly lock contention > > > in mm if PTE faults are happening. This commit introduces fault counters > > > that can be enabled and read via /sys/fs/famfs/... > > > > > > These counters have proved useful in troubleshooting situations where > > > PTE faults were happening instead of PMD. No performance impact when > > > disabled. > > > > This seems kinda wonky. Why does _this_ specific filesystem need its > > own fault counters. Seems like something we'd want to do much more > > generically, if it is needed at all. > > > > Was the issue here just that vm_ops->fault() was getting called instead > > of ->huge_fault()? Or something more subtle? > > Thanks for your reply Dave! > > First, I'm willing to pull the fault counters out if the brain trust doesn't > like them. > > I put them in because we were running benchmarks of computational data > analytics and and noted that jobs took 3x as long on famfs as raw dax - > which indicated I was doing something wrong, because it should be equivalent > or very close. > > The the solution was to call thp_get_unmapped_area() in > famfs_file_operations, and performance doesn't vary significantly from raw > dax now. Prior to that I wasn't making sure the mmap address was PMD aligned. > > After that I wanted a way to be double-secret-certain that it was servicing > PMD faults as intended. Which it basically always is, so far. (The smoke > tests in user space check this.) We had similar unit test regression concerns with fsdax where some upstream change silently broke PMD faults. The solution there was trace points in the fault handlers and a basic test that knows apriori that it *should* be triggering a certain number of huge faults: https://github.com/pmem/ndctl/blob/main/test/dax.sh#L31
On 24/02/23 12:04PM, Dan Williams wrote: > John Groves wrote: > > On 24/02/23 10:23AM, Dave Hansen wrote: > > > On 2/23/24 09:42, John Groves wrote: > > > > One of the key requirements for famfs is that it service vma faults > > > > efficiently. Our metadata helps - the search order is n for n extents, > > > > and n is usually 1. But we can still observe gnarly lock contention > > > > in mm if PTE faults are happening. This commit introduces fault counters > > > > that can be enabled and read via /sys/fs/famfs/... > > > > > > > > These counters have proved useful in troubleshooting situations where > > > > PTE faults were happening instead of PMD. No performance impact when > > > > disabled. > > > > > > This seems kinda wonky. Why does _this_ specific filesystem need its > > > own fault counters. Seems like something we'd want to do much more > > > generically, if it is needed at all. > > > > > > Was the issue here just that vm_ops->fault() was getting called instead > > > of ->huge_fault()? Or something more subtle? > > > > Thanks for your reply Dave! > > > > First, I'm willing to pull the fault counters out if the brain trust doesn't > > like them. > > > > I put them in because we were running benchmarks of computational data > > analytics and and noted that jobs took 3x as long on famfs as raw dax - > > which indicated I was doing something wrong, because it should be equivalent > > or very close. > > > > The the solution was to call thp_get_unmapped_area() in > > famfs_file_operations, and performance doesn't vary significantly from raw > > dax now. Prior to that I wasn't making sure the mmap address was PMD aligned. > > > > After that I wanted a way to be double-secret-certain that it was servicing > > PMD faults as intended. Which it basically always is, so far. (The smoke > > tests in user space check this.) > > We had similar unit test regression concerns with fsdax where some > upstream change silently broke PMD faults. The solution there was trace > points in the fault handlers and a basic test that knows apriori that it > *should* be triggering a certain number of huge faults: > > https://github.com/pmem/ndctl/blob/main/test/dax.sh#L31 Good approach, thanks Dan! My working assumption is that we'll be able to make that approach work in the famfs tests. So the fault counters should go away in the next version. John
On 2/23/24 12:39, John Groves wrote: >> We had similar unit test regression concerns with fsdax where some >> upstream change silently broke PMD faults. The solution there was trace >> points in the fault handlers and a basic test that knows apriori that it >> *should* be triggering a certain number of huge faults: >> >> https://github.com/pmem/ndctl/blob/main/test/dax.sh#L31 > Good approach, thanks Dan! My working assumption is that we'll be able to make > that approach work in the famfs tests. So the fault counters should go away > in the next version. I do really suspect there's something more generic that should be done here. Maybe we need a generic 'huge_faults' perf event to pair up with the good ol' faults that we already have: # perf stat -e faults /bin/ls Performance counter stats for '/bin/ls': 104 faults 0.001499862 seconds time elapsed 0.001490000 seconds user 0.000000000 seconds sys
Dave Hansen wrote: > On 2/23/24 12:39, John Groves wrote: > >> We had similar unit test regression concerns with fsdax where some > >> upstream change silently broke PMD faults. The solution there was trace > >> points in the fault handlers and a basic test that knows apriori that it > >> *should* be triggering a certain number of huge faults: > >> > >> https://github.com/pmem/ndctl/blob/main/test/dax.sh#L31 > > Good approach, thanks Dan! My working assumption is that we'll be able to make > > that approach work in the famfs tests. So the fault counters should go away > > in the next version. > > I do really suspect there's something more generic that should be done > here. Maybe we need a generic 'huge_faults' perf event to pair up with > the good ol' faults that we already have: > > # perf stat -e faults /bin/ls > > Performance counter stats for '/bin/ls': > > 104 faults > > > 0.001499862 seconds time elapsed > > 0.001490000 seconds user > 0.000000000 seconds sys Certainly something like that would have satisified this sanity test use case. I will note that mm_account_fault() would need some help to figure out the size of the page table entry that got installed. Maybe extensions to vm_fault_reason to add VM_FAULT_P*D? That compliments VM_FAULT_FALLBACK to indicate whether, for example, the fallback went from PUD to PMD, or all the way back to PTE. Then use cases like this could just add a dynamic probe in mm_account_fault(). No real need for a new tracepoint unless there was a use case for this outside of regression testing fault handlers, right?
On Fri, Feb 23, 2024 at 03:50:33PM -0800, Dan Williams wrote: > Certainly something like that would have satisified this sanity test use > case. I will note that mm_account_fault() would need some help to figure > out the size of the page table entry that got installed. Maybe > extensions to vm_fault_reason to add VM_FAULT_P*D? That compliments > VM_FAULT_FALLBACK to indicate whether, for example, the fallback went > from PUD to PMD, or all the way back to PTE. ugh, no, it's more complicated than that. look at the recent changes to set_ptes(). we can now install PTEs of many different sizes, depending on the architecture. someday i look forward to supporting all the page sizes on parisc (4k, 16k, 64k, 256k, ... 4G)
Matthew Wilcox wrote: > On Fri, Feb 23, 2024 at 03:50:33PM -0800, Dan Williams wrote: > > Certainly something like that would have satisified this sanity test use > > case. I will note that mm_account_fault() would need some help to figure > > out the size of the page table entry that got installed. Maybe > > extensions to vm_fault_reason to add VM_FAULT_P*D? That compliments > > VM_FAULT_FALLBACK to indicate whether, for example, the fallback went > > from PUD to PMD, or all the way back to PTE. > > ugh, no, it's more complicated than that. look at the recent changes to > set_ptes(). we can now install PTEs of many different sizes, depending > on the architecture. someday i look forward to supporting all the page > sizes on parisc (4k, 16k, 64k, 256k, ... 4G) Nice! There are enough bits in vm_fault_t to represent many page sizes instead of the entry type as I suggested, but I would defer to you or Dave on how to make "installed pte size" generically traceable per Dave's suggestion.
diff --git a/fs/famfs/famfs_file.c b/fs/famfs/famfs_file.c index fd42d5966982..a626f8a89790 100644 --- a/fs/famfs/famfs_file.c +++ b/fs/famfs/famfs_file.c @@ -19,6 +19,100 @@ #include <uapi/linux/famfs_ioctl.h> #include "famfs_internal.h" +/*************************************************************************************** + * filemap_fault counters + * + * The counters and the fault_count_enable file live at + * /sys/fs/famfs/ + */ +struct famfs_fault_counters ffc; +static int fault_count_enable; + +static ssize_t +fault_count_enable_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) +{ + return sprintf(buf, "%d\n", fault_count_enable); +} + +static ssize_t +fault_count_enable_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, + size_t count) +{ + int value; + int rc; + + rc = sscanf(buf, "%d", &value); + if (rc != 1) + return 0; + + if (value > 0) /* clear fault counters when enabling, but not when disabling */ + famfs_clear_fault_counters(&ffc); + + fault_count_enable = value; + return count; +} + +/* Individual fault counters are read-only */ +static ssize_t +fault_count_pte_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) +{ + return sprintf(buf, "%llu", famfs_pte_fault_ct(&ffc)); +} + +static ssize_t +fault_count_pmd_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) +{ + return sprintf(buf, "%llu", famfs_pmd_fault_ct(&ffc)); +} + +static ssize_t +fault_count_pud_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) +{ + return sprintf(buf, "%llu", famfs_pud_fault_ct(&ffc)); +} + +static struct kobj_attribute fault_count_enable_attribute = __ATTR(fault_count_enable, + 0660, + fault_count_enable_show, + fault_count_enable_store); +static struct kobj_attribute fault_count_pte_attribute = __ATTR(pte_fault_ct, + 0440, + fault_count_pte_show, + NULL); +static struct kobj_attribute fault_count_pmd_attribute = __ATTR(pmd_fault_ct, + 0440, + fault_count_pmd_show, + NULL); +static struct kobj_attribute fault_count_pud_attribute = __ATTR(pud_fault_ct, + 0440, + fault_count_pud_show, + NULL); + + +static struct attribute *attrs[] = { + &fault_count_enable_attribute.attr, + &fault_count_pte_attribute.attr, + &fault_count_pmd_attribute.attr, + &fault_count_pud_attribute.attr, + NULL, +}; + +struct attribute_group famfs_attr_group = { + .attrs = attrs, +}; + +/* End fault counters */ + /** * famfs_map_meta_alloc() - Allocate famfs file metadata * @mapp: Pointer to an mcache_map_meta pointer @@ -525,6 +619,9 @@ __famfs_filemap_fault( if (IS_DAX(inode)) { pfn_t pfn; + if (fault_count_enable) + famfs_inc_fault_counter_by_order(&ffc, pe_size); + ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, &famfs_iomap_ops); if (ret & VM_FAULT_NEEDDSYNC) ret = dax_finish_sync_fault(vmf, pe_size, pfn); diff --git a/fs/famfs/famfs_internal.h b/fs/famfs/famfs_internal.h index af3990d43305..987cb172a149 100644 --- a/fs/famfs/famfs_internal.h +++ b/fs/famfs/famfs_internal.h @@ -50,4 +50,77 @@ struct famfs_fs_info { char *rootdev; }; +/* + * filemap_fault counters + */ +extern struct attribute_group famfs_attr_group; + +enum famfs_fault { + FAMFS_PTE = 0, + FAMFS_PMD, + FAMFS_PUD, + FAMFS_NUM_FAULT_TYPES, +}; + +static inline int valid_fault_type(int type) +{ + if (unlikely(type < 0 || type > FAMFS_PUD)) + return 0; + return 1; +} + +struct famfs_fault_counters { + atomic64_t fault_ct[FAMFS_NUM_FAULT_TYPES]; +}; + +extern struct famfs_fault_counters ffc; + +static inline void famfs_clear_fault_counters(struct famfs_fault_counters *fc) +{ + int i; + + for (i = 0; i < FAMFS_NUM_FAULT_TYPES; i++) + atomic64_set(&fc->fault_ct[i], 0); +} + +static inline void famfs_inc_fault_counter(struct famfs_fault_counters *fc, + enum famfs_fault type) +{ + if (valid_fault_type(type)) + atomic64_inc(&fc->fault_ct[type]); +} + +static inline void famfs_inc_fault_counter_by_order(struct famfs_fault_counters *fc, int order) +{ + int pgf = -1; + + switch (order) { + case 0: + pgf = FAMFS_PTE; + break; + case PMD_ORDER: + pgf = FAMFS_PMD; + break; + case PUD_ORDER: + pgf = FAMFS_PUD; + break; + } + famfs_inc_fault_counter(fc, pgf); +} + +static inline u64 famfs_pte_fault_ct(struct famfs_fault_counters *fc) +{ + return atomic64_read(&fc->fault_ct[FAMFS_PTE]); +} + +static inline u64 famfs_pmd_fault_ct(struct famfs_fault_counters *fc) +{ + return atomic64_read(&fc->fault_ct[FAMFS_PMD]); +} + +static inline u64 famfs_pud_fault_ct(struct famfs_fault_counters *fc) +{ + return atomic64_read(&fc->fault_ct[FAMFS_PUD]); +} + #endif /* FAMFS_INTERNAL_H */
One of the key requirements for famfs is that it service vma faults efficiently. Our metadata helps - the search order is n for n extents, and n is usually 1. But we can still observe gnarly lock contention in mm if PTE faults are happening. This commit introduces fault counters that can be enabled and read via /sys/fs/famfs/... These counters have proved useful in troubleshooting situations where PTE faults were happening instead of PMD. No performance impact when disabled. Signed-off-by: John Groves <john@groves.net> --- fs/famfs/famfs_file.c | 97 +++++++++++++++++++++++++++++++++++++++ fs/famfs/famfs_internal.h | 73 +++++++++++++++++++++++++++++ 2 files changed, 170 insertions(+)