Message ID | 173629103327.22054.7411711418787098876@noble.neil.brown.name (mailing list archive) |
---|---|
State | RFC |
Headers | show |
Series | [RFC] nfsd: change filecache garbage collection list management. | expand |
On Wed, 2025-01-08 at 10:03 +1100, NeilBrown wrote: > The nfsd filecache currently uses list_lru for tracking files recently > used in NFSv3 requests which need to be "garbage collected" when they > have becoming idle - unused for 2-4 seconds. > > I do not believe list_lru is a good tool for this. It does no allow the > timeout which filecache requires so we have to add a timeout mechanism > which holds the list_lru for while the whole list is scanned looking for > entries that haven't been recently accessed. When the list is largish > (even a few hundred) this can block new requests which need the lock to > remove a file to access it. > Wait, what? The whole point of an LRU list is that the least-recently- used entries are at the head of the list. We should only need to walk down to the first entry that can't be reaped and then stop there. Why is it walking the whole list? > This patch removes the list_lru and instead uses 2 simple linked lists. > When a file is accessed it is removed from whichever list it is one, > then added to the tail of the first list. Every 2 seconds the second > list is moved to the "freeme" list and the first list is moved to the > second list. This avoids any need to walk a list to find old entries. > I do like the above idea though. That does seem more efficient than having to scan lists. > These lists are per-netns rather than global as the freeme list is > per-netns as the actual freeing is done in nfsd threads which are > per-netns. > > Also the shrinker is moved to be per-netns so that it can access just > the lists of that netns. > > We don't need a work_item for the gc pass as it never sleeps. We can > use a simple timer instead. This requires taking the spinlock with > "_bh". > > Previously a file would be unhashed before being moved to the freeme > list. We don't do that any more. The freeme list is much like the > other two lists (recent and older) in that they all hold a reference to > the file and the file is still hashed. > > When the nfsd thread processes > the freeme list it now use the new nfsd_file_release_list() which uses > nfsd_file_cond_queue() to unhash and drop the refcount. > > We no longer have a precise cound of the size of the lru (recent + > older) as we don't know how big "older" is when it is moved to "freeme". > How the shrinker can with an approximation. So when we keep a count of > number in the lru and hwen "older" is moved to "freeme" we divide that > count by 2. When we remove anything from the lru we decrement that > counter but ensure it never goes negative. Naturally when we add to the > lru we increase the counter. > > For the filecache stats file, which assumes a global lru, we keep a > separate counter which includes all files in all netns in recent or > older of freeme. > > We discard the nf_gc linkage in an nfsd_file and only use nf_lru. > We discard NFSD_FILE_REFERENCED. > nfsd_file_close_inode_sync() included a copy of > nfsd_file_dispose_list(). This has been change to call that function > instead. > > Possibly this patch could be broken into a few smaller patches. > Yeah, that would help. > Tracepoints should be revieed. > > Signed-off-by: NeilBrown <neilb@suse.de> > --- > fs/nfsd/filecache.c | 304 +++++++++++++++++++++++--------------------- > fs/nfsd/filecache.h | 4 +- > fs/nfsd/trace.h | 1 - > 3 files changed, 157 insertions(+), 152 deletions(-) > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > index a1cdba42c4fa..319c60234f09 100644 > --- a/fs/nfsd/filecache.c > +++ b/fs/nfsd/filecache.c > @@ -34,7 +34,6 @@ > #include <linux/file.h> > #include <linux/pagemap.h> > #include <linux/sched.h> > -#include <linux/list_lru.h> > #include <linux/fsnotify_backend.h> > #include <linux/fsnotify.h> > #include <linux/seq_file.h> > @@ -63,17 +62,22 @@ static DEFINE_PER_CPU(unsigned long, nfsd_file_evictions); > > struct nfsd_fcache_disposal { > spinlock_t lock; > - struct list_head freeme; > + struct timer_list timer; > + struct list_head recent; /* have been used in last 0-2 seconds */ > + struct list_head older; /* haven't been used in last 0-2 seconds */ > + struct list_head freeme; /* ready to be discarded */ > + unsigned long num_gc; /* Approximate size of recent plus older */ > + struct shrinker *file_shrinker; > + struct nfsd_net *nn; > }; > > static struct kmem_cache *nfsd_file_slab; > static struct kmem_cache *nfsd_file_mark_slab; > -static struct list_lru nfsd_file_lru; > static unsigned long nfsd_file_flags; > static struct fsnotify_group *nfsd_file_fsnotify_group; > -static struct delayed_work nfsd_filecache_laundrette; > static struct rhltable nfsd_file_rhltable > ____cacheline_aligned_in_smp; > +static atomic_long_t nfsd_lru_total = ATOMIC_LONG_INIT(0); > > static bool > nfsd_match_cred(const struct cred *c1, const struct cred *c2) > @@ -109,11 +113,14 @@ static const struct rhashtable_params nfsd_file_rhash_params = { > }; > > static void > -nfsd_file_schedule_laundrette(void) > +nfsd_file_schedule_laundrette(struct net *net) > { > - if (test_bit(NFSD_FILE_CACHE_UP, &nfsd_file_flags)) > - queue_delayed_work(system_unbound_wq, &nfsd_filecache_laundrette, > - NFSD_LAUNDRETTE_DELAY); > + struct nfsd_net *nn = net_generic(net, nfsd_net_id); > + struct nfsd_fcache_disposal *l = nn->fcache_disposal; > + > + if (test_bit(NFSD_FILE_CACHE_UP, &nfsd_file_flags) && > + !timer_pending(&l->timer)) > + mod_timer(&l->timer, jiffies + NFSD_LAUNDRETTE_DELAY); > } > > static void > @@ -218,7 +225,6 @@ nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need, > > this_cpu_inc(nfsd_file_allocations); > INIT_LIST_HEAD(&nf->nf_lru); > - INIT_LIST_HEAD(&nf->nf_gc); > nf->nf_birthtime = ktime_get(); > nf->nf_file = NULL; > nf->nf_cred = get_current_cred(); > @@ -318,23 +324,40 @@ nfsd_file_check_writeback(struct nfsd_file *nf) > mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK); > } > > - > static bool nfsd_file_lru_add(struct nfsd_file *nf) > { > - set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags); > - if (list_lru_add_obj(&nfsd_file_lru, &nf->nf_lru)) { > + struct nfsd_net *nn = net_generic(nf->nf_net, nfsd_net_id); > + struct nfsd_fcache_disposal *l = nn->fcache_disposal; > + > + spin_lock_bh(&l->lock); > + if (list_empty(&nf->nf_lru)) { > + list_add_tail(&nf->nf_lru, &l->recent); > + l->num_gc += 1; > + atomic_long_inc(&nfsd_lru_total); > + spin_unlock_bh(&l->lock); > trace_nfsd_file_lru_add(nf); > return true; > } > + spin_unlock_bh(&l->lock); > return false; > } > > static bool nfsd_file_lru_remove(struct nfsd_file *nf) > { > - if (list_lru_del_obj(&nfsd_file_lru, &nf->nf_lru)) { > + struct nfsd_net *nn = net_generic(nf->nf_net, nfsd_net_id); > + struct nfsd_fcache_disposal *l = nn->fcache_disposal; > + > + spin_lock_bh(&l->lock); > + if (!list_empty(&nf->nf_lru)) { > + list_del_init(&nf->nf_lru); > + atomic_long_dec(&nfsd_lru_total); > + if (l->num_gc > 0) > + l->num_gc -= 1; > + spin_unlock_bh(&l->lock); > trace_nfsd_file_lru_del(nf); > return true; > } > + spin_unlock_bh(&l->lock); > return false; > } > > @@ -373,7 +396,7 @@ nfsd_file_put(struct nfsd_file *nf) > if (nfsd_file_lru_add(nf)) { > /* If it's still hashed, we're done */ > if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) { > - nfsd_file_schedule_laundrette(); > + nfsd_file_schedule_laundrette(nf->nf_net); > return; > } > > @@ -424,12 +447,26 @@ nfsd_file_dispose_list(struct list_head *dispose) > struct nfsd_file *nf; > > while (!list_empty(dispose)) { > - nf = list_first_entry(dispose, struct nfsd_file, nf_gc); > - list_del_init(&nf->nf_gc); > + nf = list_first_entry(dispose, struct nfsd_file, nf_lru); > + list_del_init(&nf->nf_lru); > nfsd_file_free(nf); > } > } > > +static void > +nfsd_file_cond_queue(struct nfsd_file *nf, struct list_head *dispose); > + > +static void > +nfsd_file_release_list(struct list_head *dispose) > +{ > + LIST_HEAD(dispose2); > + struct nfsd_file *nf, *nf2; > + > + list_for_each_entry_safe(nf, nf2, dispose, nf_lru) > + nfsd_file_cond_queue(nf, &dispose2); > + nfsd_file_dispose_list(&dispose2); > +} > + > /** > * nfsd_file_dispose_list_delayed - move list of dead files to net's freeme list > * @dispose: list of nfsd_files to be disposed > @@ -442,13 +479,13 @@ nfsd_file_dispose_list_delayed(struct list_head *dispose) > { > while(!list_empty(dispose)) { > struct nfsd_file *nf = list_first_entry(dispose, > - struct nfsd_file, nf_gc); > + struct nfsd_file, nf_lru); > struct nfsd_net *nn = net_generic(nf->nf_net, nfsd_net_id); > struct nfsd_fcache_disposal *l = nn->fcache_disposal; > > - spin_lock(&l->lock); > - list_move_tail(&nf->nf_gc, &l->freeme); > - spin_unlock(&l->lock); > + spin_lock_bh(&l->lock); > + list_move_tail(&nf->nf_lru, &l->freeme); > + spin_unlock_bh(&l->lock); > svc_wake_up(nn->nfsd_serv); > } > } > @@ -470,115 +507,97 @@ void nfsd_file_net_dispose(struct nfsd_net *nn) > LIST_HEAD(dispose); > int i; > > - spin_lock(&l->lock); > - for (i = 0; i < 8 && !list_empty(&l->freeme); i++) > - list_move(l->freeme.next, &dispose); > - spin_unlock(&l->lock); > + spin_lock_bh(&l->lock); > + for (i = 0; i < 8 && !list_empty(&l->freeme); i++) { > + struct nfsd_file *nf = list_first_entry( > + &l->freeme, struct nfsd_file, nf_lru); > + > + /* > + * Don't throw out files that are still > + * undergoing I/O or that have uncleared errors > + * pending. > + */ > + if (nfsd_file_check_writeback(nf)) { > + list_move(&nf->nf_lru, &l->recent); > + l->num_gc += 1; > + } else { > + list_move(&nf->nf_lru, &dispose); > + this_cpu_inc(nfsd_file_evictions); > + } > + } > + spin_unlock_bh(&l->lock); > if (!list_empty(&l->freeme)) > /* Wake up another thread to share the work > * *before* doing any actual disposing. > */ > svc_wake_up(nn->nfsd_serv); > - nfsd_file_dispose_list(&dispose); > - } > -} > - > -/** > - * nfsd_file_lru_cb - Examine an entry on the LRU list > - * @item: LRU entry to examine > - * @lru: controlling LRU > - * @arg: dispose list > - * > - * Return values: > - * %LRU_REMOVED: @item was removed from the LRU > - * %LRU_ROTATE: @item is to be moved to the LRU tail > - * %LRU_SKIP: @item cannot be evicted > - */ > -static enum lru_status > -nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru, > - void *arg) > -{ > - struct list_head *head = arg; > - struct nfsd_file *nf = list_entry(item, struct nfsd_file, nf_lru); > - > - /* We should only be dealing with GC entries here */ > - WARN_ON_ONCE(!test_bit(NFSD_FILE_GC, &nf->nf_flags)); > - > - /* > - * Don't throw out files that are still undergoing I/O or > - * that have uncleared errors pending. > - */ > - if (nfsd_file_check_writeback(nf)) { > - trace_nfsd_file_gc_writeback(nf); > - return LRU_SKIP; > - } > - > - /* If it was recently added to the list, skip it */ > - if (test_and_clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags)) { > - trace_nfsd_file_gc_referenced(nf); > - return LRU_ROTATE; > - } > - > - /* > - * Put the reference held on behalf of the LRU. If it wasn't the last > - * one, then just remove it from the LRU and ignore it. > - */ > - if (!refcount_dec_and_test(&nf->nf_ref)) { > - trace_nfsd_file_gc_in_use(nf); > - list_lru_isolate(lru, &nf->nf_lru); > - return LRU_REMOVED; > + nfsd_file_release_list(&dispose); > } > - > - /* Refcount went to zero. Unhash it and queue it to the dispose list */ > - nfsd_file_unhash(nf); > - list_lru_isolate(lru, &nf->nf_lru); > - list_add(&nf->nf_gc, head); > - this_cpu_inc(nfsd_file_evictions); > - trace_nfsd_file_gc_disposed(nf); > - return LRU_REMOVED; > -} > - > -static void > -nfsd_file_gc(void) > -{ > - LIST_HEAD(dispose); > - unsigned long ret; > - > - ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb, > - &dispose, list_lru_count(&nfsd_file_lru)); > - trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru)); > - nfsd_file_dispose_list_delayed(&dispose); > } > > static void > -nfsd_file_gc_worker(struct work_struct *work) > +nfsd_file_gc_worker(struct timer_list *t) > { > - nfsd_file_gc(); > - if (list_lru_count(&nfsd_file_lru)) > - nfsd_file_schedule_laundrette(); > + struct nfsd_fcache_disposal *l = container_of( > + t, struct nfsd_fcache_disposal, timer); > + bool wakeup = false; > + > + spin_lock_bh(&l->lock); > + list_splice_init(&l->older, &l->freeme); > + list_splice_init(&l->recent, &l->older); > + /* We don't know how many were moved to 'freeme' and don't want > + * to waste time counting - guess a half. > + */ > + l->num_gc /= 2; > + if (!list_empty(&l->freeme)) > + wakeup = true; > + if (!list_empty(&l->older) || !list_empty(&l->recent)) > + mod_timer(t, jiffies + NFSD_LAUNDRETTE_DELAY); > + spin_unlock_bh(&l->lock); > + if (wakeup) > + svc_wake_up(l->nn->nfsd_serv); > } > > static unsigned long > nfsd_file_lru_count(struct shrinker *s, struct shrink_control *sc) > { > - return list_lru_count(&nfsd_file_lru); > + struct nfsd_fcache_disposal *l = s->private_data; > + return l->num_gc; > } > > static unsigned long > nfsd_file_lru_scan(struct shrinker *s, struct shrink_control *sc) > { > - LIST_HEAD(dispose); > - unsigned long ret; > + struct nfsd_fcache_disposal *l = s->private_data; > + struct nfsd_file *nf; > + int scanned = 0; > + > + spin_lock_bh(&l->lock); > + while (scanned < sc->nr_to_scan && > + (nf = list_first_entry_or_null(&l->older, > + struct nfsd_file, nf_lru)) != NULL) { > + list_del_init(&nf->nf_lru); > + list_add_tail(&nf->nf_lru, &l->freeme); > + if (l->num_gc > 0) > + l->num_gc -= 1; > + scanned += 1; > + } > + if (scanned > 0) > + svc_wake_up(l->nn->nfsd_serv); > + > + while (scanned < sc->nr_to_scan && > + (nf = list_first_entry_or_null(&l->recent, > + struct nfsd_file, nf_lru)) != NULL) { > + list_del_init(&nf->nf_lru); > + list_add_tail(&nf->nf_lru, &l->older); > + scanned += 1; > + } > + spin_unlock_bh(&l->lock); > > - ret = list_lru_shrink_walk(&nfsd_file_lru, sc, > - nfsd_file_lru_cb, &dispose); > - trace_nfsd_file_shrinker_removed(ret, list_lru_count(&nfsd_file_lru)); > - nfsd_file_dispose_list_delayed(&dispose); > - return ret; > + trace_nfsd_file_shrinker_removed(scanned, l->num_gc); > + return scanned; > } > > -static struct shrinker *nfsd_file_shrinker; > - > /** > * nfsd_file_cond_queue - conditionally unhash and queue a nfsd_file > * @nf: nfsd_file to attempt to queue > @@ -589,7 +608,6 @@ static struct shrinker *nfsd_file_shrinker; > */ > static void > nfsd_file_cond_queue(struct nfsd_file *nf, struct list_head *dispose) > - __must_hold(RCU) > { > int decrement = 1; > > @@ -607,7 +625,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct list_head *dispose) > > /* If refcount goes to 0, then put on the dispose list */ > if (refcount_sub_and_test(decrement, &nf->nf_ref)) { > - list_add(&nf->nf_gc, dispose); > + list_add(&nf->nf_lru, dispose); > trace_nfsd_file_closing(nf); > } > } > @@ -676,17 +694,12 @@ nfsd_file_close_inode(struct inode *inode) > void > nfsd_file_close_inode_sync(struct inode *inode) > { > - struct nfsd_file *nf; > LIST_HEAD(dispose); > > trace_nfsd_file_close(inode); > > nfsd_file_queue_for_close(inode, &dispose); > - while (!list_empty(&dispose)) { > - nf = list_first_entry(&dispose, struct nfsd_file, nf_gc); > - list_del_init(&nf->nf_gc); > - nfsd_file_free(nf); > - } > + nfsd_file_dispose_list(&dispose); > } > > static int > @@ -763,29 +776,10 @@ nfsd_file_cache_init(void) > goto out_err; > } > > - ret = list_lru_init(&nfsd_file_lru); > - if (ret) { > - pr_err("nfsd: failed to init nfsd_file_lru: %d\n", ret); > - goto out_err; > - } > - > - nfsd_file_shrinker = shrinker_alloc(0, "nfsd-filecache"); > - if (!nfsd_file_shrinker) { > - ret = -ENOMEM; > - pr_err("nfsd: failed to allocate nfsd_file_shrinker\n"); > - goto out_lru; > - } > - > - nfsd_file_shrinker->count_objects = nfsd_file_lru_count; > - nfsd_file_shrinker->scan_objects = nfsd_file_lru_scan; > - nfsd_file_shrinker->seeks = 1; > - > - shrinker_register(nfsd_file_shrinker); > - > ret = lease_register_notifier(&nfsd_file_lease_notifier); > if (ret) { > pr_err("nfsd: unable to register lease notifier: %d\n", ret); > - goto out_shrinker; > + goto out_err; > } > > nfsd_file_fsnotify_group = fsnotify_alloc_group(&nfsd_file_fsnotify_ops, > @@ -798,17 +792,12 @@ nfsd_file_cache_init(void) > goto out_notifier; > } > > - INIT_DELAYED_WORK(&nfsd_filecache_laundrette, nfsd_file_gc_worker); > out: > if (ret) > clear_bit(NFSD_FILE_CACHE_UP, &nfsd_file_flags); > return ret; > out_notifier: > lease_unregister_notifier(&nfsd_file_lease_notifier); > -out_shrinker: > - shrinker_free(nfsd_file_shrinker); > -out_lru: > - list_lru_destroy(&nfsd_file_lru); > out_err: > kmem_cache_destroy(nfsd_file_slab); > nfsd_file_slab = NULL; > @@ -860,14 +849,38 @@ nfsd_alloc_fcache_disposal(void) > if (!l) > return NULL; > spin_lock_init(&l->lock); > + timer_setup(&l->timer, nfsd_file_gc_worker, 0); > + INIT_LIST_HEAD(&l->recent); > + INIT_LIST_HEAD(&l->older); > INIT_LIST_HEAD(&l->freeme); > + l->num_gc = 0; > + l->file_shrinker = shrinker_alloc(0, "nfsd-filecache"); > + if (!l->file_shrinker) { > + pr_err("nfsd: failed to allocate nfsd_file_shrinker\n"); > + kfree(l); > + return NULL; > + } > + l->file_shrinker->count_objects = nfsd_file_lru_count; > + l->file_shrinker->scan_objects = nfsd_file_lru_scan; > + l->file_shrinker->seeks = 1; > + l->file_shrinker->private_data = l; > + > + shrinker_register(l->file_shrinker); > + > return l; > } > > static void > nfsd_free_fcache_disposal(struct nfsd_fcache_disposal *l) > { > - nfsd_file_dispose_list(&l->freeme); > + del_timer_sync(&l->timer); > + shrinker_free(l->file_shrinker); > + nfsd_file_release_list(&l->recent); > + WARN_ON_ONCE(!list_empty(&l->recent)); > + nfsd_file_release_list(&l->older); > + WARN_ON_ONCE(!list_empty(&l->older)); > + nfsd_file_release_list(&l->freeme); > + WARN_ON_ONCE(!list_empty(&l->freeme)); > kfree(l); > } > > @@ -886,6 +899,8 @@ nfsd_file_cache_start_net(struct net *net) > struct nfsd_net *nn = net_generic(net, nfsd_net_id); > > nn->fcache_disposal = nfsd_alloc_fcache_disposal(); > + if (nn->fcache_disposal) > + nn->fcache_disposal->nn = nn; > return nn->fcache_disposal ? 0 : -ENOMEM; > } > > @@ -919,14 +934,7 @@ nfsd_file_cache_shutdown(void) > return; > > lease_unregister_notifier(&nfsd_file_lease_notifier); > - shrinker_free(nfsd_file_shrinker); > - /* > - * make sure all callers of nfsd_file_lru_cb are done before > - * calling nfsd_file_cache_purge > - */ > - cancel_delayed_work_sync(&nfsd_filecache_laundrette); > __nfsd_file_cache_purge(NULL); > - list_lru_destroy(&nfsd_file_lru); > rcu_barrier(); > fsnotify_put_group(nfsd_file_fsnotify_group); > nfsd_file_fsnotify_group = NULL; > @@ -1297,7 +1305,7 @@ int nfsd_file_cache_stats_show(struct seq_file *m, void *v) > struct bucket_table *tbl; > struct rhashtable *ht; > > - lru = list_lru_count(&nfsd_file_lru); > + lru = atomic_long_read(&nfsd_lru_total); > > rcu_read_lock(); > ht = &nfsd_file_rhltable.ht; > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h > index d5db6b34ba30..d88ae287c01f 100644 > --- a/fs/nfsd/filecache.h > +++ b/fs/nfsd/filecache.h > @@ -36,15 +36,13 @@ struct nfsd_file { > struct net *nf_net; > #define NFSD_FILE_HASHED (0) > #define NFSD_FILE_PENDING (1) > -#define NFSD_FILE_REFERENCED (2) > -#define NFSD_FILE_GC (3) > +#define NFSD_FILE_GC (2) > unsigned long nf_flags; > refcount_t nf_ref; > unsigned char nf_may; > > struct nfsd_file_mark *nf_mark; > struct list_head nf_lru; > - struct list_head nf_gc; > struct rcu_head nf_rcu; > ktime_t nf_birthtime; > }; > diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h > index ad2c0c432d08..a266f1f21adc 100644 > --- a/fs/nfsd/trace.h > +++ b/fs/nfsd/trace.h > @@ -1038,7 +1038,6 @@ DEFINE_CLID_EVENT(confirmed_r); > __print_flags(val, "|", \ > { 1 << NFSD_FILE_HASHED, "HASHED" }, \ > { 1 << NFSD_FILE_PENDING, "PENDING" }, \ > - { 1 << NFSD_FILE_REFERENCED, "REFERENCED" }, \ > { 1 << NFSD_FILE_GC, "GC" }) > > DECLARE_EVENT_CLASS(nfsd_file_class, > > base-commit: 5e5a0681df5068efbd000dcfcbc34ae064fda1cc
On Thu, 09 Jan 2025, Jeff Layton wrote: > On Wed, 2025-01-08 at 10:03 +1100, NeilBrown wrote: > > The nfsd filecache currently uses list_lru for tracking files recently > > used in NFSv3 requests which need to be "garbage collected" when they > > have becoming idle - unused for 2-4 seconds. > > > > I do not believe list_lru is a good tool for this. It does no allow the > > timeout which filecache requires so we have to add a timeout mechanism > > which holds the list_lru for while the whole list is scanned looking for > > entries that haven't been recently accessed. When the list is largish > > (even a few hundred) this can block new requests which need the lock to > > remove a file to access it. > > > > Wait, what? The whole point of an LRU list is that the least-recently- > used entries are at the head of the list. We should only need to walk > down to the first entry that can't be reaped and then stop there. Why > is it walking the whole list? A "list_lru" is not just one list, it is one list per numa node. The "normal" use is that each numa node removes a few things from the head in the shrinker. Apart from nfsd, users only walk the whole list when shutting down the list. We need to visit every node in the list so we can clear the REFERENCED bit on every entry we decide to leave in the list. > > > > We discard the nf_gc linkage in an nfsd_file and only use nf_lru. > > We discard NFSD_FILE_REFERENCED. > > nfsd_file_close_inode_sync() included a copy of > > nfsd_file_dispose_list(). This has been change to call that function > > instead. > > > > Possibly this patch could be broken into a few smaller patches. > > > > Yeah, that would help. > I'm currently on leave until 20th Jan. I'll do that when I get back - if I don't find time before. Thanks, NeilBrown
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c index a1cdba42c4fa..319c60234f09 100644 --- a/fs/nfsd/filecache.c +++ b/fs/nfsd/filecache.c @@ -34,7 +34,6 @@ #include <linux/file.h> #include <linux/pagemap.h> #include <linux/sched.h> -#include <linux/list_lru.h> #include <linux/fsnotify_backend.h> #include <linux/fsnotify.h> #include <linux/seq_file.h> @@ -63,17 +62,22 @@ static DEFINE_PER_CPU(unsigned long, nfsd_file_evictions); struct nfsd_fcache_disposal { spinlock_t lock; - struct list_head freeme; + struct timer_list timer; + struct list_head recent; /* have been used in last 0-2 seconds */ + struct list_head older; /* haven't been used in last 0-2 seconds */ + struct list_head freeme; /* ready to be discarded */ + unsigned long num_gc; /* Approximate size of recent plus older */ + struct shrinker *file_shrinker; + struct nfsd_net *nn; }; static struct kmem_cache *nfsd_file_slab; static struct kmem_cache *nfsd_file_mark_slab; -static struct list_lru nfsd_file_lru; static unsigned long nfsd_file_flags; static struct fsnotify_group *nfsd_file_fsnotify_group; -static struct delayed_work nfsd_filecache_laundrette; static struct rhltable nfsd_file_rhltable ____cacheline_aligned_in_smp; +static atomic_long_t nfsd_lru_total = ATOMIC_LONG_INIT(0); static bool nfsd_match_cred(const struct cred *c1, const struct cred *c2) @@ -109,11 +113,14 @@ static const struct rhashtable_params nfsd_file_rhash_params = { }; static void -nfsd_file_schedule_laundrette(void) +nfsd_file_schedule_laundrette(struct net *net) { - if (test_bit(NFSD_FILE_CACHE_UP, &nfsd_file_flags)) - queue_delayed_work(system_unbound_wq, &nfsd_filecache_laundrette, - NFSD_LAUNDRETTE_DELAY); + struct nfsd_net *nn = net_generic(net, nfsd_net_id); + struct nfsd_fcache_disposal *l = nn->fcache_disposal; + + if (test_bit(NFSD_FILE_CACHE_UP, &nfsd_file_flags) && + !timer_pending(&l->timer)) + mod_timer(&l->timer, jiffies + NFSD_LAUNDRETTE_DELAY); } static void @@ -218,7 +225,6 @@ nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need, this_cpu_inc(nfsd_file_allocations); INIT_LIST_HEAD(&nf->nf_lru); - INIT_LIST_HEAD(&nf->nf_gc); nf->nf_birthtime = ktime_get(); nf->nf_file = NULL; nf->nf_cred = get_current_cred(); @@ -318,23 +324,40 @@ nfsd_file_check_writeback(struct nfsd_file *nf) mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK); } - static bool nfsd_file_lru_add(struct nfsd_file *nf) { - set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags); - if (list_lru_add_obj(&nfsd_file_lru, &nf->nf_lru)) { + struct nfsd_net *nn = net_generic(nf->nf_net, nfsd_net_id); + struct nfsd_fcache_disposal *l = nn->fcache_disposal; + + spin_lock_bh(&l->lock); + if (list_empty(&nf->nf_lru)) { + list_add_tail(&nf->nf_lru, &l->recent); + l->num_gc += 1; + atomic_long_inc(&nfsd_lru_total); + spin_unlock_bh(&l->lock); trace_nfsd_file_lru_add(nf); return true; } + spin_unlock_bh(&l->lock); return false; } static bool nfsd_file_lru_remove(struct nfsd_file *nf) { - if (list_lru_del_obj(&nfsd_file_lru, &nf->nf_lru)) { + struct nfsd_net *nn = net_generic(nf->nf_net, nfsd_net_id); + struct nfsd_fcache_disposal *l = nn->fcache_disposal; + + spin_lock_bh(&l->lock); + if (!list_empty(&nf->nf_lru)) { + list_del_init(&nf->nf_lru); + atomic_long_dec(&nfsd_lru_total); + if (l->num_gc > 0) + l->num_gc -= 1; + spin_unlock_bh(&l->lock); trace_nfsd_file_lru_del(nf); return true; } + spin_unlock_bh(&l->lock); return false; } @@ -373,7 +396,7 @@ nfsd_file_put(struct nfsd_file *nf) if (nfsd_file_lru_add(nf)) { /* If it's still hashed, we're done */ if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) { - nfsd_file_schedule_laundrette(); + nfsd_file_schedule_laundrette(nf->nf_net); return; } @@ -424,12 +447,26 @@ nfsd_file_dispose_list(struct list_head *dispose) struct nfsd_file *nf; while (!list_empty(dispose)) { - nf = list_first_entry(dispose, struct nfsd_file, nf_gc); - list_del_init(&nf->nf_gc); + nf = list_first_entry(dispose, struct nfsd_file, nf_lru); + list_del_init(&nf->nf_lru); nfsd_file_free(nf); } } +static void +nfsd_file_cond_queue(struct nfsd_file *nf, struct list_head *dispose); + +static void +nfsd_file_release_list(struct list_head *dispose) +{ + LIST_HEAD(dispose2); + struct nfsd_file *nf, *nf2; + + list_for_each_entry_safe(nf, nf2, dispose, nf_lru) + nfsd_file_cond_queue(nf, &dispose2); + nfsd_file_dispose_list(&dispose2); +} + /** * nfsd_file_dispose_list_delayed - move list of dead files to net's freeme list * @dispose: list of nfsd_files to be disposed @@ -442,13 +479,13 @@ nfsd_file_dispose_list_delayed(struct list_head *dispose) { while(!list_empty(dispose)) { struct nfsd_file *nf = list_first_entry(dispose, - struct nfsd_file, nf_gc); + struct nfsd_file, nf_lru); struct nfsd_net *nn = net_generic(nf->nf_net, nfsd_net_id); struct nfsd_fcache_disposal *l = nn->fcache_disposal; - spin_lock(&l->lock); - list_move_tail(&nf->nf_gc, &l->freeme); - spin_unlock(&l->lock); + spin_lock_bh(&l->lock); + list_move_tail(&nf->nf_lru, &l->freeme); + spin_unlock_bh(&l->lock); svc_wake_up(nn->nfsd_serv); } } @@ -470,115 +507,97 @@ void nfsd_file_net_dispose(struct nfsd_net *nn) LIST_HEAD(dispose); int i; - spin_lock(&l->lock); - for (i = 0; i < 8 && !list_empty(&l->freeme); i++) - list_move(l->freeme.next, &dispose); - spin_unlock(&l->lock); + spin_lock_bh(&l->lock); + for (i = 0; i < 8 && !list_empty(&l->freeme); i++) { + struct nfsd_file *nf = list_first_entry( + &l->freeme, struct nfsd_file, nf_lru); + + /* + * Don't throw out files that are still + * undergoing I/O or that have uncleared errors + * pending. + */ + if (nfsd_file_check_writeback(nf)) { + list_move(&nf->nf_lru, &l->recent); + l->num_gc += 1; + } else { + list_move(&nf->nf_lru, &dispose); + this_cpu_inc(nfsd_file_evictions); + } + } + spin_unlock_bh(&l->lock); if (!list_empty(&l->freeme)) /* Wake up another thread to share the work * *before* doing any actual disposing. */ svc_wake_up(nn->nfsd_serv); - nfsd_file_dispose_list(&dispose); - } -} - -/** - * nfsd_file_lru_cb - Examine an entry on the LRU list - * @item: LRU entry to examine - * @lru: controlling LRU - * @arg: dispose list - * - * Return values: - * %LRU_REMOVED: @item was removed from the LRU - * %LRU_ROTATE: @item is to be moved to the LRU tail - * %LRU_SKIP: @item cannot be evicted - */ -static enum lru_status -nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru, - void *arg) -{ - struct list_head *head = arg; - struct nfsd_file *nf = list_entry(item, struct nfsd_file, nf_lru); - - /* We should only be dealing with GC entries here */ - WARN_ON_ONCE(!test_bit(NFSD_FILE_GC, &nf->nf_flags)); - - /* - * Don't throw out files that are still undergoing I/O or - * that have uncleared errors pending. - */ - if (nfsd_file_check_writeback(nf)) { - trace_nfsd_file_gc_writeback(nf); - return LRU_SKIP; - } - - /* If it was recently added to the list, skip it */ - if (test_and_clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags)) { - trace_nfsd_file_gc_referenced(nf); - return LRU_ROTATE; - } - - /* - * Put the reference held on behalf of the LRU. If it wasn't the last - * one, then just remove it from the LRU and ignore it. - */ - if (!refcount_dec_and_test(&nf->nf_ref)) { - trace_nfsd_file_gc_in_use(nf); - list_lru_isolate(lru, &nf->nf_lru); - return LRU_REMOVED; + nfsd_file_release_list(&dispose); } - - /* Refcount went to zero. Unhash it and queue it to the dispose list */ - nfsd_file_unhash(nf); - list_lru_isolate(lru, &nf->nf_lru); - list_add(&nf->nf_gc, head); - this_cpu_inc(nfsd_file_evictions); - trace_nfsd_file_gc_disposed(nf); - return LRU_REMOVED; -} - -static void -nfsd_file_gc(void) -{ - LIST_HEAD(dispose); - unsigned long ret; - - ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb, - &dispose, list_lru_count(&nfsd_file_lru)); - trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru)); - nfsd_file_dispose_list_delayed(&dispose); } static void -nfsd_file_gc_worker(struct work_struct *work) +nfsd_file_gc_worker(struct timer_list *t) { - nfsd_file_gc(); - if (list_lru_count(&nfsd_file_lru)) - nfsd_file_schedule_laundrette(); + struct nfsd_fcache_disposal *l = container_of( + t, struct nfsd_fcache_disposal, timer); + bool wakeup = false; + + spin_lock_bh(&l->lock); + list_splice_init(&l->older, &l->freeme); + list_splice_init(&l->recent, &l->older); + /* We don't know how many were moved to 'freeme' and don't want + * to waste time counting - guess a half. + */ + l->num_gc /= 2; + if (!list_empty(&l->freeme)) + wakeup = true; + if (!list_empty(&l->older) || !list_empty(&l->recent)) + mod_timer(t, jiffies + NFSD_LAUNDRETTE_DELAY); + spin_unlock_bh(&l->lock); + if (wakeup) + svc_wake_up(l->nn->nfsd_serv); } static unsigned long nfsd_file_lru_count(struct shrinker *s, struct shrink_control *sc) { - return list_lru_count(&nfsd_file_lru); + struct nfsd_fcache_disposal *l = s->private_data; + return l->num_gc; } static unsigned long nfsd_file_lru_scan(struct shrinker *s, struct shrink_control *sc) { - LIST_HEAD(dispose); - unsigned long ret; + struct nfsd_fcache_disposal *l = s->private_data; + struct nfsd_file *nf; + int scanned = 0; + + spin_lock_bh(&l->lock); + while (scanned < sc->nr_to_scan && + (nf = list_first_entry_or_null(&l->older, + struct nfsd_file, nf_lru)) != NULL) { + list_del_init(&nf->nf_lru); + list_add_tail(&nf->nf_lru, &l->freeme); + if (l->num_gc > 0) + l->num_gc -= 1; + scanned += 1; + } + if (scanned > 0) + svc_wake_up(l->nn->nfsd_serv); + + while (scanned < sc->nr_to_scan && + (nf = list_first_entry_or_null(&l->recent, + struct nfsd_file, nf_lru)) != NULL) { + list_del_init(&nf->nf_lru); + list_add_tail(&nf->nf_lru, &l->older); + scanned += 1; + } + spin_unlock_bh(&l->lock); - ret = list_lru_shrink_walk(&nfsd_file_lru, sc, - nfsd_file_lru_cb, &dispose); - trace_nfsd_file_shrinker_removed(ret, list_lru_count(&nfsd_file_lru)); - nfsd_file_dispose_list_delayed(&dispose); - return ret; + trace_nfsd_file_shrinker_removed(scanned, l->num_gc); + return scanned; } -static struct shrinker *nfsd_file_shrinker; - /** * nfsd_file_cond_queue - conditionally unhash and queue a nfsd_file * @nf: nfsd_file to attempt to queue @@ -589,7 +608,6 @@ static struct shrinker *nfsd_file_shrinker; */ static void nfsd_file_cond_queue(struct nfsd_file *nf, struct list_head *dispose) - __must_hold(RCU) { int decrement = 1; @@ -607,7 +625,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct list_head *dispose) /* If refcount goes to 0, then put on the dispose list */ if (refcount_sub_and_test(decrement, &nf->nf_ref)) { - list_add(&nf->nf_gc, dispose); + list_add(&nf->nf_lru, dispose); trace_nfsd_file_closing(nf); } } @@ -676,17 +694,12 @@ nfsd_file_close_inode(struct inode *inode) void nfsd_file_close_inode_sync(struct inode *inode) { - struct nfsd_file *nf; LIST_HEAD(dispose); trace_nfsd_file_close(inode); nfsd_file_queue_for_close(inode, &dispose); - while (!list_empty(&dispose)) { - nf = list_first_entry(&dispose, struct nfsd_file, nf_gc); - list_del_init(&nf->nf_gc); - nfsd_file_free(nf); - } + nfsd_file_dispose_list(&dispose); } static int @@ -763,29 +776,10 @@ nfsd_file_cache_init(void) goto out_err; } - ret = list_lru_init(&nfsd_file_lru); - if (ret) { - pr_err("nfsd: failed to init nfsd_file_lru: %d\n", ret); - goto out_err; - } - - nfsd_file_shrinker = shrinker_alloc(0, "nfsd-filecache"); - if (!nfsd_file_shrinker) { - ret = -ENOMEM; - pr_err("nfsd: failed to allocate nfsd_file_shrinker\n"); - goto out_lru; - } - - nfsd_file_shrinker->count_objects = nfsd_file_lru_count; - nfsd_file_shrinker->scan_objects = nfsd_file_lru_scan; - nfsd_file_shrinker->seeks = 1; - - shrinker_register(nfsd_file_shrinker); - ret = lease_register_notifier(&nfsd_file_lease_notifier); if (ret) { pr_err("nfsd: unable to register lease notifier: %d\n", ret); - goto out_shrinker; + goto out_err; } nfsd_file_fsnotify_group = fsnotify_alloc_group(&nfsd_file_fsnotify_ops, @@ -798,17 +792,12 @@ nfsd_file_cache_init(void) goto out_notifier; } - INIT_DELAYED_WORK(&nfsd_filecache_laundrette, nfsd_file_gc_worker); out: if (ret) clear_bit(NFSD_FILE_CACHE_UP, &nfsd_file_flags); return ret; out_notifier: lease_unregister_notifier(&nfsd_file_lease_notifier); -out_shrinker: - shrinker_free(nfsd_file_shrinker); -out_lru: - list_lru_destroy(&nfsd_file_lru); out_err: kmem_cache_destroy(nfsd_file_slab); nfsd_file_slab = NULL; @@ -860,14 +849,38 @@ nfsd_alloc_fcache_disposal(void) if (!l) return NULL; spin_lock_init(&l->lock); + timer_setup(&l->timer, nfsd_file_gc_worker, 0); + INIT_LIST_HEAD(&l->recent); + INIT_LIST_HEAD(&l->older); INIT_LIST_HEAD(&l->freeme); + l->num_gc = 0; + l->file_shrinker = shrinker_alloc(0, "nfsd-filecache"); + if (!l->file_shrinker) { + pr_err("nfsd: failed to allocate nfsd_file_shrinker\n"); + kfree(l); + return NULL; + } + l->file_shrinker->count_objects = nfsd_file_lru_count; + l->file_shrinker->scan_objects = nfsd_file_lru_scan; + l->file_shrinker->seeks = 1; + l->file_shrinker->private_data = l; + + shrinker_register(l->file_shrinker); + return l; } static void nfsd_free_fcache_disposal(struct nfsd_fcache_disposal *l) { - nfsd_file_dispose_list(&l->freeme); + del_timer_sync(&l->timer); + shrinker_free(l->file_shrinker); + nfsd_file_release_list(&l->recent); + WARN_ON_ONCE(!list_empty(&l->recent)); + nfsd_file_release_list(&l->older); + WARN_ON_ONCE(!list_empty(&l->older)); + nfsd_file_release_list(&l->freeme); + WARN_ON_ONCE(!list_empty(&l->freeme)); kfree(l); } @@ -886,6 +899,8 @@ nfsd_file_cache_start_net(struct net *net) struct nfsd_net *nn = net_generic(net, nfsd_net_id); nn->fcache_disposal = nfsd_alloc_fcache_disposal(); + if (nn->fcache_disposal) + nn->fcache_disposal->nn = nn; return nn->fcache_disposal ? 0 : -ENOMEM; } @@ -919,14 +934,7 @@ nfsd_file_cache_shutdown(void) return; lease_unregister_notifier(&nfsd_file_lease_notifier); - shrinker_free(nfsd_file_shrinker); - /* - * make sure all callers of nfsd_file_lru_cb are done before - * calling nfsd_file_cache_purge - */ - cancel_delayed_work_sync(&nfsd_filecache_laundrette); __nfsd_file_cache_purge(NULL); - list_lru_destroy(&nfsd_file_lru); rcu_barrier(); fsnotify_put_group(nfsd_file_fsnotify_group); nfsd_file_fsnotify_group = NULL; @@ -1297,7 +1305,7 @@ int nfsd_file_cache_stats_show(struct seq_file *m, void *v) struct bucket_table *tbl; struct rhashtable *ht; - lru = list_lru_count(&nfsd_file_lru); + lru = atomic_long_read(&nfsd_lru_total); rcu_read_lock(); ht = &nfsd_file_rhltable.ht; diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h index d5db6b34ba30..d88ae287c01f 100644 --- a/fs/nfsd/filecache.h +++ b/fs/nfsd/filecache.h @@ -36,15 +36,13 @@ struct nfsd_file { struct net *nf_net; #define NFSD_FILE_HASHED (0) #define NFSD_FILE_PENDING (1) -#define NFSD_FILE_REFERENCED (2) -#define NFSD_FILE_GC (3) +#define NFSD_FILE_GC (2) unsigned long nf_flags; refcount_t nf_ref; unsigned char nf_may; struct nfsd_file_mark *nf_mark; struct list_head nf_lru; - struct list_head nf_gc; struct rcu_head nf_rcu; ktime_t nf_birthtime; }; diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h index ad2c0c432d08..a266f1f21adc 100644 --- a/fs/nfsd/trace.h +++ b/fs/nfsd/trace.h @@ -1038,7 +1038,6 @@ DEFINE_CLID_EVENT(confirmed_r); __print_flags(val, "|", \ { 1 << NFSD_FILE_HASHED, "HASHED" }, \ { 1 << NFSD_FILE_PENDING, "PENDING" }, \ - { 1 << NFSD_FILE_REFERENCED, "REFERENCED" }, \ { 1 << NFSD_FILE_GC, "GC" }) DECLARE_EVENT_CLASS(nfsd_file_class,
The nfsd filecache currently uses list_lru for tracking files recently used in NFSv3 requests which need to be "garbage collected" when they have becoming idle - unused for 2-4 seconds. I do not believe list_lru is a good tool for this. It does no allow the timeout which filecache requires so we have to add a timeout mechanism which holds the list_lru for while the whole list is scanned looking for entries that haven't been recently accessed. When the list is largish (even a few hundred) this can block new requests which need the lock to remove a file to access it. This patch removes the list_lru and instead uses 2 simple linked lists. When a file is accessed it is removed from whichever list it is one, then added to the tail of the first list. Every 2 seconds the second list is moved to the "freeme" list and the first list is moved to the second list. This avoids any need to walk a list to find old entries. These lists are per-netns rather than global as the freeme list is per-netns as the actual freeing is done in nfsd threads which are per-netns. Also the shrinker is moved to be per-netns so that it can access just the lists of that netns. We don't need a work_item for the gc pass as it never sleeps. We can use a simple timer instead. This requires taking the spinlock with "_bh". Previously a file would be unhashed before being moved to the freeme list. We don't do that any more. The freeme list is much like the other two lists (recent and older) in that they all hold a reference to the file and the file is still hashed. When the nfsd thread processes the freeme list it now use the new nfsd_file_release_list() which uses nfsd_file_cond_queue() to unhash and drop the refcount. We no longer have a precise cound of the size of the lru (recent + older) as we don't know how big "older" is when it is moved to "freeme". How the shrinker can with an approximation. So when we keep a count of number in the lru and hwen "older" is moved to "freeme" we divide that count by 2. When we remove anything from the lru we decrement that counter but ensure it never goes negative. Naturally when we add to the lru we increase the counter. For the filecache stats file, which assumes a global lru, we keep a separate counter which includes all files in all netns in recent or older of freeme. We discard the nf_gc linkage in an nfsd_file and only use nf_lru. We discard NFSD_FILE_REFERENCED. nfsd_file_close_inode_sync() included a copy of nfsd_file_dispose_list(). This has been change to call that function instead. Possibly this patch could be broken into a few smaller patches. Tracepoints should be revieed. Signed-off-by: NeilBrown <neilb@suse.de> --- fs/nfsd/filecache.c | 304 +++++++++++++++++++++++--------------------- fs/nfsd/filecache.h | 4 +- fs/nfsd/trace.h | 1 - 3 files changed, 157 insertions(+), 152 deletions(-) base-commit: 5e5a0681df5068efbd000dcfcbc34ae064fda1cc