Message ID | 1535476780-5773-3-git-send-email-longman@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fs/dcache: Track # of negative dentries | expand |
On Tue, Aug 28, 2018 at 01:19:40PM -0400, Waiman Long wrote: > @@ -134,7 +135,7 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item) > spin_lock(&nlru->lock); > if (list_empty(item)) { > l = list_lru_from_kmem(nlru, item, &memcg); > - list_add_tail(item, &l->list); > + (add_tail ? list_add_tail : list_add)(item, &l->list); > /* Set shrinker bit if the first element was added */ > if (!l->nr_items++) > memcg_set_shrinker_bit(memcg, nid, That's not OK. Write it out properly, ie: if (add_tail) list_add_tail(item, &l->list); else list_add(item, &l->list);
On 08/28/2018 06:13 PM, Matthew Wilcox wrote: > On Tue, Aug 28, 2018 at 01:19:40PM -0400, Waiman Long wrote: >> @@ -134,7 +135,7 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item) >> spin_lock(&nlru->lock); >> if (list_empty(item)) { >> l = list_lru_from_kmem(nlru, item, &memcg); >> - list_add_tail(item, &l->list); >> + (add_tail ? list_add_tail : list_add)(item, &l->list); >> /* Set shrinker bit if the first element was added */ >> if (!l->nr_items++) >> memcg_set_shrinker_bit(memcg, nid, > That's not OK. Write it out properly, ie: > > if (add_tail) > list_add_tail(item, &l->list); > else > list_add(item, &l->list); > Yes, I can rewrite it. What is the problem with the abbreviated form? Cheers, Longman
Another pet peeve ;) On Tue, 28 Aug 2018 13:19:40 -0400 Waiman Long <longman@redhat.com> wrote: > /** > + * list_lru_add_head: add an element to the lru list's head > + * @list_lru: the lru pointer > + * @item: the item to be added. > + * > + * This is similar to list_lru_add(). The only difference is the location > + * where the new item will be added. The list_lru_add() function will add People often use the term "the foo() function". I don't know why - just say "foo()"! > + * the new item to the tail as it is the most recently used one. The > + * list_lru_add_head() will add the new item into the head so that it Ditto. "to the head" > + * will the first to go if a shrinker is running. So this function should "will be the" > + * only be used for less important item that can be the first to go if "items" > + * the system is under memory pressure. > + * > + * Return value: true if the list was updated, false otherwise > + */
On Tue, Aug 28, 2018 at 3:29 PM Waiman Long <longman@redhat.com> wrote: > > Yes, I can rewrite it. What is the problem with the abbreviated form? Either gcc rewrites it for you, or you end up _actually_ using a function pointer and calling through it. The latter would be absolutely horribly bad for something like "list_add()", which should expand to just a couple of instructions. And the former would be ok, except for the "you wrote code the garbage way, and then depended on the compiler fixing it up". Which we generally try to avoid in the kernel. (Don't get me wrong - we definitely depend on the compiler doing a good job at CSE and dead code elimination etc, but generally we try to avoid the whole "compiler has to rewrite code to be good" model). Linus
On Tue, 28 Aug 2018 16:10:24 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, Aug 28, 2018 at 3:29 PM Waiman Long <longman@redhat.com> wrote: > > > > Yes, I can rewrite it. What is the problem with the abbreviated form? > > Either gcc rewrites it for you, or you end up _actually_ using a > function pointer and calling through it. > > The latter would be absolutely horribly bad for something like > "list_add()", which should expand to just a couple of instructions. > > And the former would be ok, except for the "you wrote code the garbage > way, and then depended on the compiler fixing it up". Which we > generally try to avoid in the kernel. > > (Don't get me wrong - we definitely depend on the compiler doing a > good job at CSE and dead code elimination etc, but generally we try to > avoid the whole "compiler has to rewrite code to be good" model). > And the "abbreviated form" will surely explode if one or both of those "functions" happens to be implemented (or later reimplemented) as a macro. It's best not to unnecessarily make such assumptions.
On Tue, Aug 28, 2018 at 01:19:40PM -0400, Waiman Long wrote: > For negative dentries that are accessed once and never used again, they > should be removed first before other dentries when shrinker is running. > This is done by putting negative dentries at the head of the LRU list > instead at the tail. > > A new DCACHE_NEW_NEGATIVE flag is now added to a negative dentry when it > is initially created. When such a dentry is added to the LRU, it will be > added to the head so that it will be the first to go when a shrinker is > running if it is never accessed again (DCACHE_REFERENCED bit not set). > The flag is cleared after the LRU list addition. This exposes internal LRU list functionality to callers. I carefully hid what end of the list was the most or least recent from subsystems interacting with LRUs precisely because "head" and "tail" are completely confusing when interacting with a LRU. LRUs are about tracking relative object age, not heads and tails. IOWs, "track this object as most recently used", not "add this object to the list tail". The opposite is "track this object is least recently used", not "add object at head of list". IOWs, the interface should be list_lru_add_oldest() or maybe list_lru_add_least_recent() to indicate that these objects are considered to be the oldest and therefore prime immediate reclaim candidates. Which points out a problem. That the most recent negative dentry will be the first to be reclaimed. That's not LRU behaviour, and prevents a working set of negative dentries from being retained when the shrinker rotates through the dentries in LRU order. i.e. this patch turns the LRU into a MRU list for negative dentries. And then there's shrinker behaviour. What happens if the shrinker isolate callback returns LRU_ROTATE on a negative dentry? It gets moved to the most recent end of the list, so it won't have an attempt to reclaim it again until it's tried to reclaim all the real dentries. IOWs, it goes back to behaving like LRUs are supposed to behaving. IOWs, reclaim behaviour of negative dentries will be highly unpredictable, it will not easily retain a working set, nor will the working set it does retain be predictable or easy to eject from memory when the workload changes. Is this the behavour what we want for negative dentries? Perhaps a second internal LRU list in the list_lru for "immediate reclaim" objects would be a better solution. i.e. similar to the active/inactive lists used for prioritising the working set iover single use pages in page reclaim. negative dentries go onto the immediate list, real dentries go on the existing list. Both are LRU, and the shrinker operates on the immediate list first. When we rotate referenced negative dentries on the immediate list, promote them to the active list with all the real dentries so they age out with the rest of the working set. That way single use negative dentries get removed in LRU order in preference to the working set of real dentries. Being able to make changes to the list implementation easily was one of the reasons I hid the implementation of the list_lru from the interface callers use.... [...] > @@ -430,8 +424,20 @@ static void d_lru_add(struct dentry *dentry) > D_FLAG_VERIFY(dentry, 0); > dentry->d_flags |= DCACHE_LRU_LIST; > this_cpu_inc(nr_dentry_unused); > + if (d_is_negative(dentry)) { > + __neg_dentry_inc(dentry); /me notes this patch now open codes this, like suggested in the previous patch. > + if (dentry->d_flags & DCACHE_NEW_NEGATIVE) { > + /* > + * Add the negative dentry to the head once, it > + * will be added to the tail next time. > + */ > + WARN_ON_ONCE(!list_lru_add_head( > + &dentry->d_sb->s_dentry_lru, &dentry->d_lru)); > + dentry->d_flags &= ~DCACHE_NEW_NEGATIVE; > + return; > + } > + } > WARN_ON_ONCE(!list_lru_add(&dentry->d_sb->s_dentry_lru, &dentry->d_lru)); > - neg_dentry_inc(dentry); > } > > static void d_lru_del(struct dentry *dentry) > @@ -2620,6 +2626,9 @@ static inline void __d_add(struct dentry *dentry, struct inode *inode) > __d_set_inode_and_type(dentry, inode, add_flags); > raw_write_seqcount_end(&dentry->d_seq); > fsnotify_update_flags(dentry); > + } else { > + /* It is a negative dentry, add it to LRU head initially. */ > + dentry->d_flags |= DCACHE_NEW_NEGATIVE; Comments about LRU behaviour should not be put anywhere but in the lru code. Otherwise it ends up stale and wrong, assuming it is correct to start with.... > } > __d_rehash(dentry); > if (dir) > diff --git a/include/linux/dcache.h b/include/linux/dcache.h > index df942e5..03a1918 100644 > --- a/include/linux/dcache.h > +++ b/include/linux/dcache.h > @@ -214,6 +214,7 @@ struct dentry_operations { > #define DCACHE_FALLTHRU 0x01000000 /* Fall through to lower layer */ > #define DCACHE_ENCRYPTED_WITH_KEY 0x02000000 /* dir is encrypted with a valid key */ > #define DCACHE_OP_REAL 0x04000000 > +#define DCACHE_NEW_NEGATIVE 0x08000000 /* New negative dentry */ > > #define DCACHE_PAR_LOOKUP 0x10000000 /* being looked up (with parent locked shared) */ > #define DCACHE_DENTRY_CURSOR 0x20000000 > diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h > index aa5efd9..bfac057 100644 > --- a/include/linux/list_lru.h > +++ b/include/linux/list_lru.h > @@ -90,6 +90,23 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware, > bool list_lru_add(struct list_lru *lru, struct list_head *item); > > /** > + * list_lru_add_head: add an element to the lru list's head > + * @list_lru: the lru pointer > + * @item: the item to be added. > + * > + * This is similar to list_lru_add(). The only difference is the location > + * where the new item will be added. The list_lru_add() function will add > + * the new item to the tail as it is the most recently used one. The > + * list_lru_add_head() will add the new item into the head so that it > + * will the first to go if a shrinker is running. So this function should > + * only be used for less important item that can be the first to go if > + * the system is under memory pressure. As mentioned above, this API needs reference object ages, not internal list ordering and shrinker implementation details. Cheers, Dave.
On 08/28/2018 07:10 PM, Linus Torvalds wrote: > On Tue, Aug 28, 2018 at 3:29 PM Waiman Long <longman@redhat.com> wrote: >> Yes, I can rewrite it. What is the problem with the abbreviated form? > Either gcc rewrites it for you, or you end up _actually_ using a > function pointer and calling through it. Yes, function pointer will be really bad. > > The latter would be absolutely horribly bad for something like > "list_add()", which should expand to just a couple of instructions. > > And the former would be ok, except for the "you wrote code the garbage > way, and then depended on the compiler fixing it up". Which we > generally try to avoid in the kernel. > > (Don't get me wrong - we definitely depend on the compiler doing a > good job at CSE and dead code elimination etc, but generally we try to > avoid the whole "compiler has to rewrite code to be good" model). > > Linus I see your point here. I will rewrite to use the regular if-then-else. Thanks, Longman
On 08/28/2018 07:22 PM, Andrew Morton wrote: > On Tue, 28 Aug 2018 16:10:24 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > >> On Tue, Aug 28, 2018 at 3:29 PM Waiman Long <longman@redhat.com> wrote: >>> Yes, I can rewrite it. What is the problem with the abbreviated form? >> Either gcc rewrites it for you, or you end up _actually_ using a >> function pointer and calling through it. >> >> The latter would be absolutely horribly bad for something like >> "list_add()", which should expand to just a couple of instructions. >> >> And the former would be ok, except for the "you wrote code the garbage >> way, and then depended on the compiler fixing it up". Which we >> generally try to avoid in the kernel. >> >> (Don't get me wrong - we definitely depend on the compiler doing a >> good job at CSE and dead code elimination etc, but generally we try to >> avoid the whole "compiler has to rewrite code to be good" model). >> > And the "abbreviated form" will surely explode if one or both of those > "functions" happens to be implemented (or later reimplemented) as a macro. > It's best not to unnecessarily make such assumptions. > Yes, that is true. Thanks, Longman
On Tue 28-08-18 13:19:40, Waiman Long wrote: > For negative dentries that are accessed once and never used again, they > should be removed first before other dentries when shrinker is running. > This is done by putting negative dentries at the head of the LRU list > instead at the tail. > > A new DCACHE_NEW_NEGATIVE flag is now added to a negative dentry when it > is initially created. When such a dentry is added to the LRU, it will be > added to the head so that it will be the first to go when a shrinker is > running if it is never accessed again (DCACHE_REFERENCED bit not set). > The flag is cleared after the LRU list addition. Placing object to the head of the LRU list can be really tricky as Dave pointed out. I am not familiar with the dentry cache reclaim so my comparison below might not apply. Let me try anyway. Negative dentries sound very similar to MADV_FREE pages from the reclaim POV. They are primary candidate for reclaim, yet you want to preserve aging to other easily reclaimable objects (including other MADV_FREE pages). What we do for those pages is to move them from the anonymous LRU list to the inactive file LRU list. Now you obviously do not have anon/file LRUs but something similar to active/inactive LRU lists might be a reasonably good match. Have easily reclaimable dentries on the inactive list including negative dentries. If negative entries are heavily used then they can promote to the active list because there is no reason to reclaim them soon. Just my 2c
On Tue, Aug 28, 2018 at 04:01:50PM -0700, Andrew Morton wrote: > > Another pet peeve ;) > > On Tue, 28 Aug 2018 13:19:40 -0400 Waiman Long <longman@redhat.com> wrote: > > > /** > > + * list_lru_add_head: add an element to the lru list's head > > + * @list_lru: the lru pointer > > + * @item: the item to be added. > > + * > > + * This is similar to list_lru_add(). The only difference is the location > > + * where the new item will be added. The list_lru_add() function will add > > People often use the term "the foo() function". I don't know why - > just say "foo()"! For whatever it is worth... I tend to use "The foo() function ..." instead of "foo() ..." in order to properly capitalize the first word of the sentence. So I might say "The call_rcu() function enqueues an RCU callback." rather than something like "call_rcu() enqueues an RCU callback." Or I might use some other trick to keep "call_rcu()" from being the first word of the sentence. But if the end of the previous sentence introduced call_rcu(), you usually want the next sentence's first use of "call_rcu()" to be very early in the sentence, because otherwise the flow will seem choppy. And no, I have no idea what I would do if I were writing in German, where nouns are capitalized, given that function names tend to be used as nouns. Probably I would get yelled at a lot for capitalizing my function names. ;-) Thanx, Paul
On 08/28/2018 09:02 PM, Dave Chinner wrote: > On Tue, Aug 28, 2018 at 01:19:40PM -0400, Waiman Long wrote: >> For negative dentries that are accessed once and never used again, they >> should be removed first before other dentries when shrinker is running. >> This is done by putting negative dentries at the head of the LRU list >> instead at the tail. >> >> A new DCACHE_NEW_NEGATIVE flag is now added to a negative dentry when it >> is initially created. When such a dentry is added to the LRU, it will be >> added to the head so that it will be the first to go when a shrinker is >> running if it is never accessed again (DCACHE_REFERENCED bit not set). >> The flag is cleared after the LRU list addition. > This exposes internal LRU list functionality to callers. I carefully > hid what end of the list was the most or least recent from > subsystems interacting with LRUs precisely because "head" and "tail" > are completely confusing when interacting with a LRU. > > LRUs are about tracking relative object age, not heads and tails. > > IOWs, "track this object as most recently used", not "add this > object to the list tail". The opposite is "track this object is > least recently used", not "add object at head of list". > > IOWs, the interface should be list_lru_add_oldest() or maybe > list_lru_add_least_recent() to indicate that these objects are > considered to be the oldest and therefore prime immediate reclaim > candidates. Good point. I will use age as function name instead. > Which points out a problem. That the most recent negative dentry > will be the first to be reclaimed. That's not LRU behaviour, and > prevents a working set of negative dentries from being retained > when the shrinker rotates through the dentries in LRU order. i.e. > this patch turns the LRU into a MRU list for negative dentries. > > And then there's shrinker behaviour. What happens if the shrinker > isolate callback returns LRU_ROTATE on a negative dentry? It gets > moved to the most recent end of the list, so it won't have an > attempt to reclaim it again until it's tried to reclaim all the real > dentries. IOWs, it goes back to behaving like LRUs are supposed to > behaving. > > IOWs, reclaim behaviour of negative dentries will be > highly unpredictable, it will not easily retain a working set, nor > will the working set it does retain be predictable or easy to eject > from memory when the workload changes. > > Is this the behavour what we want for negative dentries? I am aware that the behavior is not strictly LRU for negative dentries. This is a compromise for using one LRU list for 2 different classes of dentries. The basic idea is that negative dentries that are used only once will go first irrespective of their age. > Perhaps a second internal LRU list in the list_lru for "immediate > reclaim" objects would be a better solution. i.e. similar to the > active/inactive lists used for prioritising the working set iover > single use pages in page reclaim. negative dentries go onto the > immediate list, real dentries go on the existing list. Both are LRU, > and the shrinker operates on the immediate list first. When we > rotate referenced negative dentries on the immediate list, promote > them to the active list with all the real dentries so they age out > with the rest of the working set. That way single use negative > dentries get removed in LRU order in preference to the working set > of real dentries. > > Being able to make changes to the list implementation easily was one > of the reasons I hid the implementation of the list_lru from the > interface callers use.... > > [...] I have thought about using 2 lists for dentries. That will require much more extensive changes to the code and hence much more testing will be needed to verify their correctness. That is the main reason why I try to avoid doing that. As you have suggested, we could implement this 2-level LRU list in the list_lru API. But it is used by other subsystems as well. Extensive change like that will have similar issue in term of testing and verification effort. >> @@ -430,8 +424,20 @@ static void d_lru_add(struct dentry *dentry) >> D_FLAG_VERIFY(dentry, 0); >> dentry->d_flags |= DCACHE_LRU_LIST; >> this_cpu_inc(nr_dentry_unused); >> + if (d_is_negative(dentry)) { >> + __neg_dentry_inc(dentry); > /me notes this patch now open codes this, like suggested in the > previous patch. > >> + if (dentry->d_flags & DCACHE_NEW_NEGATIVE) { >> + /* >> + * Add the negative dentry to the head once, it >> + * will be added to the tail next time. >> + */ >> + WARN_ON_ONCE(!list_lru_add_head( >> + &dentry->d_sb->s_dentry_lru, &dentry->d_lru)); >> + dentry->d_flags &= ~DCACHE_NEW_NEGATIVE; >> + return; >> + } >> + } >> WARN_ON_ONCE(!list_lru_add(&dentry->d_sb->s_dentry_lru, &dentry->d_lru)); >> - neg_dentry_inc(dentry); >> } >> >> static void d_lru_del(struct dentry *dentry) >> @@ -2620,6 +2626,9 @@ static inline void __d_add(struct dentry *dentry, struct inode *inode) >> __d_set_inode_and_type(dentry, inode, add_flags); >> raw_write_seqcount_end(&dentry->d_seq); >> fsnotify_update_flags(dentry); >> + } else { >> + /* It is a negative dentry, add it to LRU head initially. */ >> + dentry->d_flags |= DCACHE_NEW_NEGATIVE; > Comments about LRU behaviour should not be put anywhere but in the > lru code. Otherwise it ends up stale and wrong, assuming it is > correct to start with.... > I will document all that in the LRU code. >> } >> __d_rehash(dentry); >> if (dir) >> diff --git a/include/linux/dcache.h b/include/linux/dcache.h >> index df942e5..03a1918 100644 >> --- a/include/linux/dcache.h >> +++ b/include/linux/dcache.h >> @@ -214,6 +214,7 @@ struct dentry_operations { >> #define DCACHE_FALLTHRU 0x01000000 /* Fall through to lower layer */ >> #define DCACHE_ENCRYPTED_WITH_KEY 0x02000000 /* dir is encrypted with a valid key */ >> #define DCACHE_OP_REAL 0x04000000 >> +#define DCACHE_NEW_NEGATIVE 0x08000000 /* New negative dentry */ >> >> #define DCACHE_PAR_LOOKUP 0x10000000 /* being looked up (with parent locked shared) */ >> #define DCACHE_DENTRY_CURSOR 0x20000000 >> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h >> index aa5efd9..bfac057 100644 >> --- a/include/linux/list_lru.h >> +++ b/include/linux/list_lru.h >> @@ -90,6 +90,23 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware, >> bool list_lru_add(struct list_lru *lru, struct list_head *item); >> >> /** >> + * list_lru_add_head: add an element to the lru list's head >> + * @list_lru: the lru pointer >> + * @item: the item to be added. >> + * >> + * This is similar to list_lru_add(). The only difference is the location >> + * where the new item will be added. The list_lru_add() function will add >> + * the new item to the tail as it is the most recently used one. The >> + * list_lru_add_head() will add the new item into the head so that it >> + * will the first to go if a shrinker is running. So this function should >> + * only be used for less important item that can be the first to go if >> + * the system is under memory pressure. > As mentioned above, this API needs reference object ages, not > internal list ordering and shrinker implementation details. Sure. Thanks for the suggestion. Cheers, Longman
On 08/29/2018 03:51 AM, Michal Hocko wrote: > On Tue 28-08-18 13:19:40, Waiman Long wrote: >> For negative dentries that are accessed once and never used again, they >> should be removed first before other dentries when shrinker is running. >> This is done by putting negative dentries at the head of the LRU list >> instead at the tail. >> >> A new DCACHE_NEW_NEGATIVE flag is now added to a negative dentry when it >> is initially created. When such a dentry is added to the LRU, it will be >> added to the head so that it will be the first to go when a shrinker is >> running if it is never accessed again (DCACHE_REFERENCED bit not set). >> The flag is cleared after the LRU list addition. > Placing object to the head of the LRU list can be really tricky as Dave > pointed out. I am not familiar with the dentry cache reclaim so my > comparison below might not apply. Let me try anyway. > > Negative dentries sound very similar to MADV_FREE pages from the reclaim > POV. They are primary candidate for reclaim, yet you want to preserve > aging to other easily reclaimable objects (including other MADV_FREE > pages). What we do for those pages is to move them from the anonymous > LRU list to the inactive file LRU list. Now you obviously do not have > anon/file LRUs but something similar to active/inactive LRU lists might > be a reasonably good match. Have easily reclaimable dentries on the > inactive list including negative dentries. If negative entries are > heavily used then they can promote to the active list because there is > no reason to reclaim them soon. > > Just my 2c As mentioned in my reply to Dave, I did considered using a 2 LRU list solution. However, that will add more complexity to the dcache LRU management code than my current approach and probably more potential for slowdown. One point that I forgot to mention is that the current dcache LRU list will only update the order of the dentries when the list is being shrinked. If a dentry was referenced again before (the DCACHE_REFERENCED bit set), it will rotated to the tail of the list via LRU rotation instead of being removed. This is not strictly LRU anyway. The way that my patch work will just make it move further away from true LRU behavior. Cheers, Longman
On 08/29/2018 01:54 PM, Paul E. McKenney wrote: > On Tue, Aug 28, 2018 at 04:01:50PM -0700, Andrew Morton wrote: >> Another pet peeve ;) >> >> On Tue, 28 Aug 2018 13:19:40 -0400 Waiman Long <longman@redhat.com> wrote: >> >>> /** >>> + * list_lru_add_head: add an element to the lru list's head >>> + * @list_lru: the lru pointer >>> + * @item: the item to be added. >>> + * >>> + * This is similar to list_lru_add(). The only difference is the location >>> + * where the new item will be added. The list_lru_add() function will add >> People often use the term "the foo() function". I don't know why - >> just say "foo()"! > For whatever it is worth... > > I tend to use "The foo() function ..." instead of "foo() ..." in order > to properly capitalize the first word of the sentence. So I might say > "The call_rcu() function enqueues an RCU callback." rather than something > like "call_rcu() enqueues an RCU callback." Or I might use some other > trick to keep "call_rcu()" from being the first word of the sentence. > But if the end of the previous sentence introduced call_rcu(), you > usually want the next sentence's first use of "call_rcu()" to be very > early in the sentence, because otherwise the flow will seem choppy. > > And no, I have no idea what I would do if I were writing in German, > where nouns are capitalized, given that function names tend to be used > as nouns. Probably I would get yelled at a lot for capitalizing my > function names. ;-) > > Thanx, Paul > Yes, doing proper capitalization of the first letter of a sentence is the main reason I used "The foo() function" in a sentence. Cheers, Longman
On Wed, Aug 29, 2018 at 10:54:05AM -0700, Paul E. McKenney wrote: > On Tue, Aug 28, 2018 at 04:01:50PM -0700, Andrew Morton wrote: > > People often use the term "the foo() function". I don't know why - > > just say "foo()"! > > For whatever it is worth... > > I tend to use "The foo() function ..." instead of "foo() ..." in order > to properly capitalize the first word of the sentence. So I might say > "The call_rcu() function enqueues an RCU callback." rather than something > like "call_rcu() enqueues an RCU callback." Or I might use some other > trick to keep "call_rcu()" from being the first word of the sentence. I tend to write 'Use call_rcu() to enqueue a callback." or "When call_rcu() returns, the callback will have been enqueued".
On Wed, Aug 29, 2018 at 03:34:05PM -0400, Waiman Long wrote: > On 08/28/2018 09:02 PM, Dave Chinner wrote: > > On Tue, Aug 28, 2018 at 01:19:40PM -0400, Waiman Long wrote: > > And then there's shrinker behaviour. What happens if the shrinker > > isolate callback returns LRU_ROTATE on a negative dentry? It gets > > moved to the most recent end of the list, so it won't have an > > attempt to reclaim it again until it's tried to reclaim all the real > > dentries. IOWs, it goes back to behaving like LRUs are supposed to > > behaving. > > > > IOWs, reclaim behaviour of negative dentries will be > > highly unpredictable, it will not easily retain a working set, nor > > will the working set it does retain be predictable or easy to eject > > from memory when the workload changes. > > > > Is this the behavour what we want for negative dentries? > > I am aware that the behavior is not strictly LRU for negative dentries. > This is a compromise for using one LRU list for 2 different classes of > dentries. Thus demonstrating just enough knowledge to be dangerous. We already 3 different classes of dentries on the LRU list: - in use dentries (because we use lazy removal to avoid lru list lock contention on cache hit lookups) - unused, referenced dentries - unused, unreferenced dentries. Each of these classes of dentries are treated differently by the shrinker, but the key point is that they are all aged the same way and so there's consistent maintenance of the working set under memory pressure. Putting negative dentries at the head of the list doesn't create a new "class" of object on the LRU, it just changes the ordering of the lru list. This will cause unpredictable behaviour because objects haven't had a chance to age gracefully before they are reclaimed. FYI, the inode cache has the same list_lru setup, object types and shrinker algorithm as the dentry cache, so this isn't a one-off. Indeed, the XFS buffer cache has a multi-reference heirarchy of 13 different types of {unused, referenced} buffers in it's list_lru to implement a quasi aging-NFU reclaim algorithm in it's shrinker. i.e. the list_lru infrastructure has never provided or enforced a pure LRU algorithm. It is common infrastructure intended to provide a scalable, flexible and memcg-aware FIFO-like object tracking system that interates tightly with memory reclaim to allow subsystems to implement cache reclaim algorithms that are optimal for that subsystem. IOWs, the list_lru doesn't define the reclaim algorithm the subsystem uses and there's no reason why we can't extend the infrastructure to support more complex algorithms without impacting existing subsystem reclaim algorithms at all. Only the subsystems that use the new infrastructure and algorithms would need careful examination. Of course, the overall system cache balancing behaviour under different and changing workloads would still need to be verified, but you have to do that for any cache reclaim algorithm change that is made.... > The basic idea is that negative dentries that are used only > once will go first irrespective of their age. Using MRU for negative dentries, as I've previously explained, is a flawed concept. It might be expedient to solve your specific problem, but it's not a solution to the general problem of negative dentry management. > > Perhaps a second internal LRU list in the list_lru for "immediate > > reclaim" objects would be a better solution. i.e. similar to the > > active/inactive lists used for prioritising the working set iover > > single use pages in page reclaim. negative dentries go onto the > > immediate list, real dentries go on the existing list. Both are LRU, > > and the shrinker operates on the immediate list first. When we > > rotate referenced negative dentries on the immediate list, promote > > them to the active list with all the real dentries so they age out > > with the rest of the working set. That way single use negative > > dentries get removed in LRU order in preference to the working set > > of real dentries. > > > > Being able to make changes to the list implementation easily was one > > of the reasons I hid the implementation of the list_lru from the > > interface callers use.... > > > > [...] > > I have thought about using 2 lists for dentries. That will require much > more extensive changes to the code and hence much more testing will be > needed to verify their correctness. That is the main reason why I try to > avoid doing that. i.e. expediency. However, you're changing the behaviour of core caching and memory reclaim algorithms. The amount and level of testing and verification you need to do is the same regardless of whether it's a small change or a large change. Sure, you've shown that *one* artificial micro-benchmark improves, but what about everything else? > As you have suggested, we could implement this 2-level LRU list in the > list_lru API. But it is used by other subsystems as well. Extensive > change like that will have similar issue in term of testing and > verification effort. I know what changing reclaim algorithm involves, how difficult it is to balance the competing caches quickly and to the desired ratios for acceptable performance, how difficult it is to measure and control the system reacts to transient and impulse memory pressure events, etc. I also know that "simple" and/or "obviously correct" subsystem changes can cause very unexepected system level effects, and that it's almost never what you think it is that caused the unexpected behaviour. IOWs, getting anything even slightly wrong in these algorithms will adversely affect system performance and balance significantly. Hence the bar is /always/ set high for core caching algorithm changes like this. Cheers, Dave.
On Wed 29-08-18 15:58:52, Waiman Long wrote: > On 08/29/2018 03:51 AM, Michal Hocko wrote: > > On Tue 28-08-18 13:19:40, Waiman Long wrote: > >> For negative dentries that are accessed once and never used again, they > >> should be removed first before other dentries when shrinker is running. > >> This is done by putting negative dentries at the head of the LRU list > >> instead at the tail. > >> > >> A new DCACHE_NEW_NEGATIVE flag is now added to a negative dentry when it > >> is initially created. When such a dentry is added to the LRU, it will be > >> added to the head so that it will be the first to go when a shrinker is > >> running if it is never accessed again (DCACHE_REFERENCED bit not set). > >> The flag is cleared after the LRU list addition. > > Placing object to the head of the LRU list can be really tricky as Dave > > pointed out. I am not familiar with the dentry cache reclaim so my > > comparison below might not apply. Let me try anyway. > > > > Negative dentries sound very similar to MADV_FREE pages from the reclaim > > POV. They are primary candidate for reclaim, yet you want to preserve > > aging to other easily reclaimable objects (including other MADV_FREE > > pages). What we do for those pages is to move them from the anonymous > > LRU list to the inactive file LRU list. Now you obviously do not have > > anon/file LRUs but something similar to active/inactive LRU lists might > > be a reasonably good match. Have easily reclaimable dentries on the > > inactive list including negative dentries. If negative entries are > > heavily used then they can promote to the active list because there is > > no reason to reclaim them soon. > > > > Just my 2c > > As mentioned in my reply to Dave, I did considered using a 2 LRU list > solution. However, that will add more complexity to the dcache LRU > management code than my current approach and probably more potential for > slowdown. I completely agree with Dave here. This is not easy but trying to sneak in something that works for an _artificial_ workload is simply a no go. So if it takes to come with a more complex solution to cover more general workloads then be it. Someone has to bite a bullet and explore that direction. It won't be a simple project but well, if negative dentries really matter then it is worth making the reclaim design robust and comprehensible rather than adhoc and unpredictable.
On 08/30/2018 03:20 AM, Michal Hocko wrote: > On Wed 29-08-18 15:58:52, Waiman Long wrote: >> On 08/29/2018 03:51 AM, Michal Hocko wrote: >>> On Tue 28-08-18 13:19:40, Waiman Long wrote: >>>> For negative dentries that are accessed once and never used again, they >>>> should be removed first before other dentries when shrinker is running. >>>> This is done by putting negative dentries at the head of the LRU list >>>> instead at the tail. >>>> >>>> A new DCACHE_NEW_NEGATIVE flag is now added to a negative dentry when it >>>> is initially created. When such a dentry is added to the LRU, it will be >>>> added to the head so that it will be the first to go when a shrinker is >>>> running if it is never accessed again (DCACHE_REFERENCED bit not set). >>>> The flag is cleared after the LRU list addition. >>> Placing object to the head of the LRU list can be really tricky as Dave >>> pointed out. I am not familiar with the dentry cache reclaim so my >>> comparison below might not apply. Let me try anyway. >>> >>> Negative dentries sound very similar to MADV_FREE pages from the reclaim >>> POV. They are primary candidate for reclaim, yet you want to preserve >>> aging to other easily reclaimable objects (including other MADV_FREE >>> pages). What we do for those pages is to move them from the anonymous >>> LRU list to the inactive file LRU list. Now you obviously do not have >>> anon/file LRUs but something similar to active/inactive LRU lists might >>> be a reasonably good match. Have easily reclaimable dentries on the >>> inactive list including negative dentries. If negative entries are >>> heavily used then they can promote to the active list because there is >>> no reason to reclaim them soon. >>> >>> Just my 2c >> As mentioned in my reply to Dave, I did considered using a 2 LRU list >> solution. However, that will add more complexity to the dcache LRU >> management code than my current approach and probably more potential for >> slowdown. > I completely agree with Dave here. This is not easy but trying to sneak > in something that works for an _artificial_ workload is simply a no go. > So if it takes to come with a more complex solution to cover more > general workloads then be it. Someone has to bite a bullet and explore > that direction. It won't be a simple project but well, if negative > dentries really matter then it is worth making the reclaim design robust > and comprehensible rather than adhoc and unpredictable. OK, I will need to spend more time to think about a better way of doing that. Cheers, Longman
On 08/29/2018 09:12 PM, Dave Chinner wrote: > On Wed, Aug 29, 2018 at 03:34:05PM -0400, Waiman Long wrote: >> On 08/28/2018 09:02 PM, Dave Chinner wrote: >>> On Tue, Aug 28, 2018 at 01:19:40PM -0400, Waiman Long wrote: >>> And then there's shrinker behaviour. What happens if the shrinker >>> isolate callback returns LRU_ROTATE on a negative dentry? It gets >>> moved to the most recent end of the list, so it won't have an >>> attempt to reclaim it again until it's tried to reclaim all the real >>> dentries. IOWs, it goes back to behaving like LRUs are supposed to >>> behaving. >>> >>> IOWs, reclaim behaviour of negative dentries will be >>> highly unpredictable, it will not easily retain a working set, nor >>> will the working set it does retain be predictable or easy to eject >>> from memory when the workload changes. >>> >>> Is this the behavour what we want for negative dentries? >> I am aware that the behavior is not strictly LRU for negative dentries. >> This is a compromise for using one LRU list for 2 different classes of >> dentries. > Thus demonstrating just enough knowledge to be dangerous. > > We already 3 different classes of dentries on the LRU list: > > - in use dentries (because we use lazy removal to avoid > lru list lock contention on cache hit lookups) > - unused, referenced dentries > - unused, unreferenced dentries. > > Each of these classes of dentries are treated differently by the > shrinker, but the key point is that they are all aged the same way > and so there's consistent maintenance of the working set under > memory pressure. Putting negative dentries at the head of the list > doesn't create a new "class" of object on the LRU, it just changes > the ordering of the lru list. This will cause unpredictable > behaviour because objects haven't had a chance to age gracefully > before they are reclaimed. > > FYI, the inode cache has the same list_lru setup, object types and > shrinker algorithm as the dentry cache, so this isn't a one-off. > Indeed, the XFS buffer cache has a multi-reference heirarchy of 13 > different types of {unused, referenced} buffers in it's list_lru to > implement a quasi aging-NFU reclaim algorithm in it's shrinker. > > i.e. the list_lru infrastructure has never provided or enforced a > pure LRU algorithm. It is common infrastructure intended to provide > a scalable, flexible and memcg-aware FIFO-like object tracking > system that interates tightly with memory reclaim to allow > subsystems to implement cache reclaim algorithms that are optimal > for that subsystem. > > IOWs, the list_lru doesn't define the reclaim algorithm the > subsystem uses and there's no reason why we can't extend the > infrastructure to support more complex algorithms without impacting > existing subsystem reclaim algorithms at all. Only the subsystems > that use the new infrastructure and algorithms would need careful > examination. Of course, the overall system cache balancing > behaviour under different and changing workloads would still need to > be verified, but you have to do that for any cache reclaim algorithm > change that is made.... > >> The basic idea is that negative dentries that are used only >> once will go first irrespective of their age. > Using MRU for negative dentries, as I've previously explained, is a > flawed concept. It might be expedient to solve your specific > problem, but it's not a solution to the general problem of negative > dentry management. > >>> Perhaps a second internal LRU list in the list_lru for "immediate >>> reclaim" objects would be a better solution. i.e. similar to the >>> active/inactive lists used for prioritising the working set iover >>> single use pages in page reclaim. negative dentries go onto the >>> immediate list, real dentries go on the existing list. Both are LRU, >>> and the shrinker operates on the immediate list first. When we >>> rotate referenced negative dentries on the immediate list, promote >>> them to the active list with all the real dentries so they age out >>> with the rest of the working set. That way single use negative >>> dentries get removed in LRU order in preference to the working set >>> of real dentries. >>> >>> Being able to make changes to the list implementation easily was one >>> of the reasons I hid the implementation of the list_lru from the >>> interface callers use.... >>> >>> [...] >> I have thought about using 2 lists for dentries. That will require much >> more extensive changes to the code and hence much more testing will be >> needed to verify their correctness. That is the main reason why I try to >> avoid doing that. > i.e. expediency. > > However, you're changing the behaviour of core caching and memory > reclaim algorithms. The amount and level of testing and verification > you need to do is the same regardless of whether it's a small change > or a large change. Sure, you've shown that *one* artificial > micro-benchmark improves, but what about everything else? > >> As you have suggested, we could implement this 2-level LRU list in the >> list_lru API. But it is used by other subsystems as well. Extensive >> change like that will have similar issue in term of testing and >> verification effort. > I know what changing reclaim algorithm involves, how difficult > it is to balance the competing caches quickly and to the desired > ratios for acceptable performance, how difficult it is to measure > and control the system reacts to transient and impulse memory > pressure events, etc. > > I also know that "simple" and/or "obviously correct" subsystem > changes can cause very unexepected system level effects, and that > it's almost never what you think it is that caused the unexpected > behaviour. IOWs, getting anything even slightly wrong in these > algorithms will adversely affect system performance and balance > significantly. Hence the bar is /always/ set high for core caching > algorithm changes like this. > > Cheers, > > Dave. Thanks for the comments. I will need more time to think about it. Cheers, Longman
diff --git a/fs/dcache.c b/fs/dcache.c index 69f5541..ab6a4cf 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -242,12 +242,6 @@ static inline void __neg_dentry_inc(struct dentry *dentry) this_cpu_inc(nr_dentry_neg); } -static inline void neg_dentry_inc(struct dentry *dentry) -{ - if (unlikely(d_is_negative(dentry))) - __neg_dentry_inc(dentry); -} - static inline int dentry_cmp(const struct dentry *dentry, const unsigned char *ct, unsigned tcount) { /* @@ -353,7 +347,7 @@ static inline void __d_set_inode_and_type(struct dentry *dentry, dentry->d_inode = inode; flags = READ_ONCE(dentry->d_flags); - flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU); + flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU | DCACHE_NEW_NEGATIVE); flags |= type_flags; WRITE_ONCE(dentry->d_flags, flags); } @@ -430,8 +424,20 @@ static void d_lru_add(struct dentry *dentry) D_FLAG_VERIFY(dentry, 0); dentry->d_flags |= DCACHE_LRU_LIST; this_cpu_inc(nr_dentry_unused); + if (d_is_negative(dentry)) { + __neg_dentry_inc(dentry); + if (dentry->d_flags & DCACHE_NEW_NEGATIVE) { + /* + * Add the negative dentry to the head once, it + * will be added to the tail next time. + */ + WARN_ON_ONCE(!list_lru_add_head( + &dentry->d_sb->s_dentry_lru, &dentry->d_lru)); + dentry->d_flags &= ~DCACHE_NEW_NEGATIVE; + return; + } + } WARN_ON_ONCE(!list_lru_add(&dentry->d_sb->s_dentry_lru, &dentry->d_lru)); - neg_dentry_inc(dentry); } static void d_lru_del(struct dentry *dentry) @@ -2620,6 +2626,9 @@ static inline void __d_add(struct dentry *dentry, struct inode *inode) __d_set_inode_and_type(dentry, inode, add_flags); raw_write_seqcount_end(&dentry->d_seq); fsnotify_update_flags(dentry); + } else { + /* It is a negative dentry, add it to LRU head initially. */ + dentry->d_flags |= DCACHE_NEW_NEGATIVE; } __d_rehash(dentry); if (dir) diff --git a/include/linux/dcache.h b/include/linux/dcache.h index df942e5..03a1918 100644 --- a/include/linux/dcache.h +++ b/include/linux/dcache.h @@ -214,6 +214,7 @@ struct dentry_operations { #define DCACHE_FALLTHRU 0x01000000 /* Fall through to lower layer */ #define DCACHE_ENCRYPTED_WITH_KEY 0x02000000 /* dir is encrypted with a valid key */ #define DCACHE_OP_REAL 0x04000000 +#define DCACHE_NEW_NEGATIVE 0x08000000 /* New negative dentry */ #define DCACHE_PAR_LOOKUP 0x10000000 /* being looked up (with parent locked shared) */ #define DCACHE_DENTRY_CURSOR 0x20000000 diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h index aa5efd9..bfac057 100644 --- a/include/linux/list_lru.h +++ b/include/linux/list_lru.h @@ -90,6 +90,23 @@ int __list_lru_init(struct list_lru *lru, bool memcg_aware, bool list_lru_add(struct list_lru *lru, struct list_head *item); /** + * list_lru_add_head: add an element to the lru list's head + * @list_lru: the lru pointer + * @item: the item to be added. + * + * This is similar to list_lru_add(). The only difference is the location + * where the new item will be added. The list_lru_add() function will add + * the new item to the tail as it is the most recently used one. The + * list_lru_add_head() will add the new item into the head so that it + * will the first to go if a shrinker is running. So this function should + * only be used for less important item that can be the first to go if + * the system is under memory pressure. + * + * Return value: true if the list was updated, false otherwise + */ +bool list_lru_add_head(struct list_lru *lru, struct list_head *item); + +/** * list_lru_del: delete an element to the lru list * @list_lru: the lru pointer * @item: the item to be deleted. diff --git a/mm/list_lru.c b/mm/list_lru.c index 5b30625..133f41c 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -124,7 +124,8 @@ static inline bool list_lru_memcg_aware(struct list_lru *lru) } #endif /* CONFIG_MEMCG_KMEM */ -bool list_lru_add(struct list_lru *lru, struct list_head *item) +static inline bool __list_lru_add(struct list_lru *lru, struct list_head *item, + const bool add_tail) { int nid = page_to_nid(virt_to_page(item)); struct list_lru_node *nlru = &lru->node[nid]; @@ -134,7 +135,7 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item) spin_lock(&nlru->lock); if (list_empty(item)) { l = list_lru_from_kmem(nlru, item, &memcg); - list_add_tail(item, &l->list); + (add_tail ? list_add_tail : list_add)(item, &l->list); /* Set shrinker bit if the first element was added */ if (!l->nr_items++) memcg_set_shrinker_bit(memcg, nid, @@ -146,8 +147,19 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item) spin_unlock(&nlru->lock); return false; } + +bool list_lru_add(struct list_lru *lru, struct list_head *item) +{ + return __list_lru_add(lru, item, true); +} EXPORT_SYMBOL_GPL(list_lru_add); +bool list_lru_add_head(struct list_lru *lru, struct list_head *item) +{ + return __list_lru_add(lru, item, false); +} +EXPORT_SYMBOL_GPL(list_lru_add_head); + bool list_lru_del(struct list_lru *lru, struct list_head *item) { int nid = page_to_nid(virt_to_page(item));
For negative dentries that are accessed once and never used again, they should be removed first before other dentries when shrinker is running. This is done by putting negative dentries at the head of the LRU list instead at the tail. A new DCACHE_NEW_NEGATIVE flag is now added to a negative dentry when it is initially created. When such a dentry is added to the LRU, it will be added to the head so that it will be the first to go when a shrinker is running if it is never accessed again (DCACHE_REFERENCED bit not set). The flag is cleared after the LRU list addition. Suggested-by: Larry Woodman <lwoodman@redhat.com> Signed-off-by: Waiman Long <longman@redhat.com> --- fs/dcache.c | 25 +++++++++++++++++-------- include/linux/dcache.h | 1 + include/linux/list_lru.h | 17 +++++++++++++++++ mm/list_lru.c | 16 ++++++++++++++-- 4 files changed, 49 insertions(+), 10 deletions(-)