Message ID | 20180502222635.1862-1-mszeredi@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, May 03, 2018 at 12:26:35AM +0200, Miklos Szeredi wrote: > When multiple shrinkers are operating on a directory containing many > dentries, it takes much longer than if only one shrinker is operating on > the directory. > > Call the shrinker instances A and B, which shrink DIR containing NUM > dentries. > > Assume A wins the race for locking DIR's d_lock, then it goes onto moving > all unlinked dentries to its dispose list. When it's done, then B will > scan the directory once again, but will find that all dentries are already > being shrunk, so it will have an empty dispose list. Both A and B will > have found NUM dentries (data.found == NUM). > > Now comes the interesting part: A will proceed to shrink the dispose list > by killing individual dentries and decrementing the refcount of the parent > (which is DIR). NB: decrementing DIR's refcount will block if DIR's d_lock > is held. B will shrink a zero size list and then immediately restart > scanning the directory, where it will lock DIR's d_lock, scan the remaining > dentries and find no dentry to dispose. > > So that results in B doing the directory scan over and over again, holding > d_lock of DIR, while A is waiting for a chance to decrement refcount of DIR > and making very slow progress because of this. B is wasting time and > holding up progress of A at the same time. > > Proposed fix is to check this situation in B (found some dentries, but > all are being shrunk already) and just sleep for some time, before retrying > the scan. The sleep is proportional to the number of found dentries. The thing is, the majority of massive shrink_dcache_parent() can be killed. Let's do that first and see if anything else is really needed. As it is, rmdir() and rename() are ridiculously bad - they should only call shrink_dcache_parent() after successful ->rmdir() or ->rename(). Sure, there are other places where we do large shrink_dcache_parent() runs, but those won't trigger in parallel on the same tree. IOW, let's wait adding complexity until we fix the sources of those calls.
On Thu, May 3, 2018 at 12:45 AM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Thu, May 03, 2018 at 12:26:35AM +0200, Miklos Szeredi wrote: >> When multiple shrinkers are operating on a directory containing many >> dentries, it takes much longer than if only one shrinker is operating on >> the directory. >> >> Call the shrinker instances A and B, which shrink DIR containing NUM >> dentries. >> >> Assume A wins the race for locking DIR's d_lock, then it goes onto moving >> all unlinked dentries to its dispose list. When it's done, then B will >> scan the directory once again, but will find that all dentries are already >> being shrunk, so it will have an empty dispose list. Both A and B will >> have found NUM dentries (data.found == NUM). >> >> Now comes the interesting part: A will proceed to shrink the dispose list >> by killing individual dentries and decrementing the refcount of the parent >> (which is DIR). NB: decrementing DIR's refcount will block if DIR's d_lock >> is held. B will shrink a zero size list and then immediately restart >> scanning the directory, where it will lock DIR's d_lock, scan the remaining >> dentries and find no dentry to dispose. >> >> So that results in B doing the directory scan over and over again, holding >> d_lock of DIR, while A is waiting for a chance to decrement refcount of DIR >> and making very slow progress because of this. B is wasting time and >> holding up progress of A at the same time. >> >> Proposed fix is to check this situation in B (found some dentries, but >> all are being shrunk already) and just sleep for some time, before retrying >> the scan. The sleep is proportional to the number of found dentries. > > The thing is, the majority of massive shrink_dcache_parent() can be killed. > Let's do that first and see if anything else is really needed. > > As it is, rmdir() and rename() are ridiculously bad - they should only call > shrink_dcache_parent() after successful ->rmdir() or ->rename(). Sure, > there are other places where we do large shrink_dcache_parent() runs, > but those won't trigger in parallel on the same tree. I think we are cat hit this also with lru pruner (prune_dcache_sb(), shrink_dcache_sb()) running in parallel with shrink_dcache_parent(). Although shrink_dcache_sb() looks better in this regard, since it will only hold up to 1024 dentries in the dispose list. I'm open to a better solution, but keep in mind that it will also need backporting. Thanks, Miklos
On Thu, May 3, 2018 at 9:44 AM, Miklos Szeredi <miklos@szeredi.hu> wrote: > On Thu, May 3, 2018 at 12:45 AM, Al Viro <viro@zeniv.linux.org.uk> wrote: >> On Thu, May 03, 2018 at 12:26:35AM +0200, Miklos Szeredi wrote: >>> When multiple shrinkers are operating on a directory containing many >>> dentries, it takes much longer than if only one shrinker is operating on >>> the directory. >>> >>> Call the shrinker instances A and B, which shrink DIR containing NUM >>> dentries. >>> >>> Assume A wins the race for locking DIR's d_lock, then it goes onto moving >>> all unlinked dentries to its dispose list. When it's done, then B will >>> scan the directory once again, but will find that all dentries are already >>> being shrunk, so it will have an empty dispose list. Both A and B will >>> have found NUM dentries (data.found == NUM). >>> >>> Now comes the interesting part: A will proceed to shrink the dispose list >>> by killing individual dentries and decrementing the refcount of the parent >>> (which is DIR). NB: decrementing DIR's refcount will block if DIR's d_lock >>> is held. B will shrink a zero size list and then immediately restart >>> scanning the directory, where it will lock DIR's d_lock, scan the remaining >>> dentries and find no dentry to dispose. >>> >>> So that results in B doing the directory scan over and over again, holding >>> d_lock of DIR, while A is waiting for a chance to decrement refcount of DIR >>> and making very slow progress because of this. B is wasting time and >>> holding up progress of A at the same time. >>> >>> Proposed fix is to check this situation in B (found some dentries, but >>> all are being shrunk already) and just sleep for some time, before retrying >>> the scan. The sleep is proportional to the number of found dentries. >> >> The thing is, the majority of massive shrink_dcache_parent() can be killed. >> Let's do that first and see if anything else is really needed. >> >> As it is, rmdir() and rename() are ridiculously bad - they should only call >> shrink_dcache_parent() after successful ->rmdir() or ->rename(). Sure, >> there are other places where we do large shrink_dcache_parent() runs, >> but those won't trigger in parallel on the same tree. > > I think we are cat hit this also with lru pruner (prune_dcache_sb(), > shrink_dcache_sb()) running in parallel with shrink_dcache_parent(). > Although shrink_dcache_sb() looks better in this regard, since it will > only hold up to 1024 dentries in the dispose list. Looking more, prune_dcache_sb() will also batch with a max of 1024 objects. Which mitigates the problem, but doesn't make it go away. Killing 1024 dentries still takes on the order of 100us without contention on d_lock. If shrink_dcache_parent() is busy looping on those dentries, then contention will make this much worse. Thanks, Miklos
diff --git a/fs/dcache.c b/fs/dcache.c index 60df712262c2..ff250f3843d7 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -30,6 +30,7 @@ #include <linux/bit_spinlock.h> #include <linux/rculist_bl.h> #include <linux/list_lru.h> +#include <linux/delay.h> #include "internal.h" #include "mount.h" @@ -1479,9 +1480,15 @@ void shrink_dcache_parent(struct dentry *parent) continue; } - cond_resched(); if (!data.found) break; + + /* + * Wait some for other shrinkers to process these found + * dentries. This formula gives about 100ns on average per + * dentry for large number of dentries. + */ + usleep_range(data.found / 15 + 1, data.found / 7 + 2); } } EXPORT_SYMBOL(shrink_dcache_parent);
When multiple shrinkers are operating on a directory containing many dentries, it takes much longer than if only one shrinker is operating on the directory. Call the shrinker instances A and B, which shrink DIR containing NUM dentries. Assume A wins the race for locking DIR's d_lock, then it goes onto moving all unlinked dentries to its dispose list. When it's done, then B will scan the directory once again, but will find that all dentries are already being shrunk, so it will have an empty dispose list. Both A and B will have found NUM dentries (data.found == NUM). Now comes the interesting part: A will proceed to shrink the dispose list by killing individual dentries and decrementing the refcount of the parent (which is DIR). NB: decrementing DIR's refcount will block if DIR's d_lock is held. B will shrink a zero size list and then immediately restart scanning the directory, where it will lock DIR's d_lock, scan the remaining dentries and find no dentry to dispose. So that results in B doing the directory scan over and over again, holding d_lock of DIR, while A is waiting for a chance to decrement refcount of DIR and making very slow progress because of this. B is wasting time and holding up progress of A at the same time. Proposed fix is to check this situation in B (found some dentries, but all are being shrunk already) and just sleep for some time, before retrying the scan. The sleep is proportional to the number of found dentries. Test script: --- 8< --- 8< --- 8< --- 8< --- 8< --- #!/bin/bash TESTROOT=/var/tmp/test-root SUBDIR=$TESTROOT/sub/dir prepare() { rm -rf $TESTROOT mkdir -p $SUBDIR for (( i = 0; i < 1000; i++ )); do for ((j = 0; j < 1000; j++)); do if test -e $SUBDIR/$i.$j; then echo "This should not happen!" exit 1 fi done printf "%i (%s) ...\r" $((($i + 1) * $j)) `grep dentry /proc/slabinfo | sed -e "s/dentry *\([0-9]*\).*/\1/"` done } prepare printf "\nStarting shrinking\n" time rmdir $TESTROOT 2> /dev/null prepare printf "\nStarting parallel shrinking\n" time (rmdir $SUBDIR & rmdir $TESTROOT 2> /dev/null & wait) --- 8< --- 8< --- 8< --- 8< --- 8< --- Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> --- fs/dcache.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)