[5/5] fs: don't set *REFERENCED unless we are on the lru list
diff mbox

Message ID 1477420904-1399-6-git-send-email-jbacik@fb.com
State Not Applicable
Headers show

Commit Message

Josef Bacik Oct. 25, 2016, 6:41 p.m. UTC
With anything that populates the inode/dentry cache with a lot of one time use
inodes we can really put a lot of pressure on the system for things we don't
need to keep in cache.  It takes two runs through the LRU to evict these one use
entries, and if you have a lot of memory you can end up with 10's of millions of
entries in the dcache or icache that have never actually been touched since they
were first instantiated, and it will take a lot of CPU and a lot of pressure to
evict all of them.

So instead do what we do with pagecache, only set the *REFERENCED flags if we
are being used after we've been put onto the LRU.  This makes a significant
difference in the system's ability to evict these useless cache entries.  With a
fs_mark workload that creates 40 million files we get the following results (all
in files/sec)

Btrfs			Patched		Unpatched
Average Files/sec:	72209.3		63254.2
p50 Files/sec:		70850		57560
p90 Files/sec:		68757		53085
p99 Files/sec:		68757		53085

XFS			Patched		Unpatched
Average Files/sec:	61025.5		60719.5
p50 Files/sec:		60107		59465
p90 Files/sec: 		59300		57966
p99 Files/sec: 		59227		57528

Ext4			Patched		Unpatched
Average Files/sec:	121785.4	119248.0
p50 Files/sec:		120852		119274
p90 Files/sec:		116703		112783
p99 Files/sec:		114393		104934

The reason Btrfs has a much larger improvement is because it holds a lot more
things in memory so benefits more from faster slab reclaim, but across the board
is an improvement for each of the file systems.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/dcache.c | 15 ++++++++++-----
 fs/inode.c  |  5 ++++-
 2 files changed, 14 insertions(+), 6 deletions(-)

Comments

Dave Chinner Oct. 25, 2016, 10:01 p.m. UTC | #1
On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote:
> With anything that populates the inode/dentry cache with a lot of one time use
> inodes we can really put a lot of pressure on the system for things we don't
> need to keep in cache.  It takes two runs through the LRU to evict these one use
> entries, and if you have a lot of memory you can end up with 10's of millions of
> entries in the dcache or icache that have never actually been touched since they
> were first instantiated, and it will take a lot of CPU and a lot of pressure to
> evict all of them.
> 
> So instead do what we do with pagecache, only set the *REFERENCED flags if we
> are being used after we've been put onto the LRU.  This makes a significant
> difference in the system's ability to evict these useless cache entries.  With a
> fs_mark workload that creates 40 million files we get the following results (all
> in files/sec)

What's the workload, storage, etc?

> Btrfs			Patched		Unpatched
> Average Files/sec:	72209.3		63254.2
> p50 Files/sec:	70850		57560
> p90 Files/sec:	68757		53085
> p99 Files/sec:	68757		53085

So how much of this is from changing the dentry referenced
behaviour, and how much from the inode? Can you separate out the two
changes so we know which one is actually affecting reclaim
performance?

Indeed, I wonder if just changing the superblock shrinker
default_seeks for btrfs would have exactly the same impact because
that canbe used to exactly double the reclaim scan rate for the same
memory pressure.  If that doesn't change performance by a similar
amount (changing defaults seeks is the normal way of changing
shrinker balance), then more digging is required here to explain why
the referenced bits make such an impact to steady state
performance...

> XFS			Patched		Unpatched
> Average Files/sec:	61025.5		60719.5
> p50 Files/sec:	60107		59465
> p90 Files/sec:	59300		57966
> p99 Files/sec:	59227		57528

You made XFS never use I_REFERENCED at all (hint: not all
filesystems use find_inode/find_inode_fast()), so it's not clear
that the extra scanning (less than 1% difference in average
throughput) is actuallly the cause of things being slower in btrfs.

> The reason Btrfs has a much larger improvement is because it holds a lot more
> things in memory so benefits more from faster slab reclaim, but across the board
> is an improvement for each of the file systems.

Less than 1% for XFS and ~1.5% for ext4 is well within the
run-to-run variation of fsmark. It looks like it might be slightly
faster, but it's not a cut-and-dried win for anything other than
btrfs.

Cheers,

Dave.
Dave Chinner Oct. 25, 2016, 11:36 p.m. UTC | #2
On Wed, Oct 26, 2016 at 09:01:13AM +1100, Dave Chinner wrote:
> On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote:
> > With anything that populates the inode/dentry cache with a lot of one time use
> > inodes we can really put a lot of pressure on the system for things we don't
> > need to keep in cache.  It takes two runs through the LRU to evict these one use
> > entries, and if you have a lot of memory you can end up with 10's of millions of
> > entries in the dcache or icache that have never actually been touched since they
> > were first instantiated, and it will take a lot of CPU and a lot of pressure to
> > evict all of them.
> > 
> > So instead do what we do with pagecache, only set the *REFERENCED flags if we
> > are being used after we've been put onto the LRU.  This makes a significant
> > difference in the system's ability to evict these useless cache entries.  With a
> > fs_mark workload that creates 40 million files we get the following results (all
> > in files/sec)
> 
> What's the workload, storage, etc?
> 
> > Btrfs			Patched		Unpatched
> > Average Files/sec:	72209.3		63254.2
> > p50 Files/sec:	70850		57560
> > p90 Files/sec:	68757		53085
> > p99 Files/sec:	68757		53085
> 
> So how much of this is from changing the dentry referenced
> behaviour, and how much from the inode? Can you separate out the two
> changes so we know which one is actually affecting reclaim
> performance?

FWIW, I just went to run my usual zero-length file creation fsmark
test (16-way create, large sparse FS image on SSDs). XFS (with debug
enabled) takes 4m10s to run at an average of ~230k files/s, with a
std deviation of +/-1.7k files/s.

Unfortunately, btrfs turns that into more heat than it ever has done
before. It's only managing 35k files/s and the profile looks like
this:

  58.79%  [kernel]  [k] __pv_queued_spin_lock_slowpath
   5.61%  [kernel]  [k] queued_write_lock_slowpath
   1.65%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
   1.38%  [kernel]  [k] reschedule_interrupt
   1.08%  [kernel]  [k] _raw_spin_lock
   0.92%  [kernel]  [k] __radix_tree_lookup
   0.86%  [kernel]  [k] _raw_spin_lock_irqsave
   0.83%  [kernel]  [k] btrfs_set_lock_blocking_rw

I killed it because this would take too long to run.

I reduced the concurrency down to 4-way, spinlock contention went
down to about 40% of the CPU time.  I reduced the concurrency down
to 2 and saw about 16% of CPU time being spent in lock contention.

Throughput results:
				btrfs throughput
			2-way			4-way
unpatched	46938.1+/-2.8e+03	40273.4+/-3.1e+03
patched		45697.2+/-2.4e+03	49287.1+/-3e+03


So, 2-way has not improved. If changing referenced behaviour was an
obvious win for btrfs, we'd expect to see that here as well.
however, because 4-way improved by 20%, I think all we're seeing is
a slight change in lock contention levels inside btrfs. Indeed,
looking at the profiles I see that lock contention time was reduced
to around 32% of the total CPU used (down by about 20%):

  20.79%  [kernel]  [k] __pv_queued_spin_lock_slowpath
   3.85%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
   3.68%  [kernel]  [k] _raw_spin_lock
   3.40%  [kernel]  [k] queued_write_lock_slowpath
   .....

IOWs, the performance increase comes from the fact that changing
inode cache eviction patterns causes slightly less lock contention
in btrfs inode reclaim. IOWs, the problem that needs fixing is the
btrfs lock contention, not the VFS cache LRU algorithms.

Root cause analysis needs to be done properly before behavioural
changes like this are proposed, people!

-Dave.

PS: I caught this profile on unmount when the 8 million inodes
cached inodes were being reclaimed. evict_inodes() ignores the
referenced bit, so this gives a lot of insight into the work being
done by inode reclaim in a filesystem:

  18.54%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
   9.43%  [kernel]  [k] rb_erase
   8.03%  [kernel]  [k] __btrfs_release_delayed_node
   7.23%  [kernel]  [k] _raw_spin_lock
   6.93%  [kernel]  [k] __list_del_entry
   4.35%  [kernel]  [k] __slab_free
   3.93%  [kernel]  [k] __mutex_lock_slowpath
   2.77%  [kernel]  [k] bit_waitqueue
   2.58%  [kernel]  [k] kmem_cache_alloc
   2.50%  [kernel]  [k] __radix_tree_lookup
   2.44%  [kernel]  [k] _raw_spin_lock_irq
   2.18%  [kernel]  [k] kmem_cache_free
   2.17%  [kernel]  [k] evict			<<<<<<<<<<<
   2.13%  [kernel]  [k] fsnotify_destroy_marks
   1.68%  [kernel]  [k] btrfs_remove_delayed_node
   1.61%  [kernel]  [k] __call_rcu.constprop.70
   1.50%  [kernel]  [k] __remove_inode_hash
   1.49%  [kernel]  [k] kmem_cache_alloc_trace
   1.39%  [kernel]  [k] ___might_sleep
   1.15%  [kernel]  [k] __memset
   1.12%  [kernel]  [k] __mutex_unlock_slowpath
   1.03%  [kernel]  [k] evict_inodes		<<<<<<<<<<
   1.02%  [kernel]  [k] cmpxchg_double_slab.isra.66
   0.93%  [kernel]  [k] free_extent_map
   0.83%  [kernel]  [k] _raw_write_lock
   0.69%  [kernel]  [k] __might_sleep
   0.56%  [kernel]  [k] _raw_spin_unlock

The VFS inode cache traversal to evict inodes is a very small part
of the CPU usage being recorded. btrfs lock traffic alone accounts
for more than 10x as much CPU usage as inode cache traversal....
Josef Bacik Oct. 26, 2016, 3:11 p.m. UTC | #3
On 10/25/2016 06:01 PM, Dave Chinner wrote:
> On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote:
>> With anything that populates the inode/dentry cache with a lot of one time use
>> inodes we can really put a lot of pressure on the system for things we don't
>> need to keep in cache.  It takes two runs through the LRU to evict these one use
>> entries, and if you have a lot of memory you can end up with 10's of millions of
>> entries in the dcache or icache that have never actually been touched since they
>> were first instantiated, and it will take a lot of CPU and a lot of pressure to
>> evict all of them.
>>
>> So instead do what we do with pagecache, only set the *REFERENCED flags if we
>> are being used after we've been put onto the LRU.  This makes a significant
>> difference in the system's ability to evict these useless cache entries.  With a
>> fs_mark workload that creates 40 million files we get the following results (all
>> in files/sec)
>
> What's the workload, storage, etc?

Oops sorry I thought I said it.  It's fs_mark creating 20 million empty files on 
a single NVME drive.

>
>> Btrfs			Patched		Unpatched
>> Average Files/sec:	72209.3		63254.2
>> p50 Files/sec:	70850		57560
>> p90 Files/sec:	68757		53085
>> p99 Files/sec:	68757		53085
>
> So how much of this is from changing the dentry referenced
> behaviour, and how much from the inode? Can you separate out the two
> changes so we know which one is actually affecting reclaim
> performance?
>
> Indeed, I wonder if just changing the superblock shrinker
> default_seeks for btrfs would have exactly the same impact because
> that canbe used to exactly double the reclaim scan rate for the same
> memory pressure.  If that doesn't change performance by a similar
> amount (changing defaults seeks is the normal way of changing
> shrinker balance), then more digging is required here to explain why
> the referenced bits make such an impact to steady state
> performance...
>

I'll tease out the impact of changing dcache vs icache vs both.  Yeah I'll 
reduce default_seeks and see what that turns out to be.

>> XFS			Patched		Unpatched
>> Average Files/sec:	61025.5		60719.5
>> p50 Files/sec:	60107		59465
>> p90 Files/sec:	59300		57966
>> p99 Files/sec:	59227		57528
>
> You made XFS never use I_REFERENCED at all (hint: not all
> filesystems use find_inode/find_inode_fast()), so it's not clear
> that the extra scanning (less than 1% difference in average
> throughput) is actuallly the cause of things being slower in btrfs.
>
>> The reason Btrfs has a much larger improvement is because it holds a lot more
>> things in memory so benefits more from faster slab reclaim, but across the board
>> is an improvement for each of the file systems.
>
> Less than 1% for XFS and ~1.5% for ext4 is well within the
> run-to-run variation of fsmark. It looks like it might be slightly
> faster, but it's not a cut-and-dried win for anything other than
> btrfs.
>

Sure the win in this benchmark is clearly benefiting btrfs the most, but I think 
the overall approach is sound and likely to help everybody in theory.  Inside FB 
we definitely have had problems where the memory pressure induced by some 
idi^H^H^Hprocess goes along and runs find / which causes us to evict real things 
that are being used rather than these one use inodes.  This sort of behavior 
could possibly be mitigated by this patch, but I haven't sat down to figure out 
a reliable way to mirror this workload to test that theory.  Thanks

Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik Oct. 26, 2016, 8:03 p.m. UTC | #4
On 10/25/2016 07:36 PM, Dave Chinner wrote:
> On Wed, Oct 26, 2016 at 09:01:13AM +1100, Dave Chinner wrote:
>> On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote:
>>> With anything that populates the inode/dentry cache with a lot of one time use
>>> inodes we can really put a lot of pressure on the system for things we don't
>>> need to keep in cache.  It takes two runs through the LRU to evict these one use
>>> entries, and if you have a lot of memory you can end up with 10's of millions of
>>> entries in the dcache or icache that have never actually been touched since they
>>> were first instantiated, and it will take a lot of CPU and a lot of pressure to
>>> evict all of them.
>>>
>>> So instead do what we do with pagecache, only set the *REFERENCED flags if we
>>> are being used after we've been put onto the LRU.  This makes a significant
>>> difference in the system's ability to evict these useless cache entries.  With a
>>> fs_mark workload that creates 40 million files we get the following results (all
>>> in files/sec)
>>
>> What's the workload, storage, etc?
>>
>>> Btrfs			Patched		Unpatched
>>> Average Files/sec:	72209.3		63254.2
>>> p50 Files/sec:	70850		57560
>>> p90 Files/sec:	68757		53085
>>> p99 Files/sec:	68757		53085
>>
>> So how much of this is from changing the dentry referenced
>> behaviour, and how much from the inode? Can you separate out the two
>> changes so we know which one is actually affecting reclaim
>> performance?

(Ok figured out the problem, apparently outlook thinks you are spam and has been 
filtering all your emails into its junk folder without giving them to me.  I've 
fixed this so now we should be able to get somewhere)

>
> FWIW, I just went to run my usual zero-length file creation fsmark
> test (16-way create, large sparse FS image on SSDs). XFS (with debug
> enabled) takes 4m10s to run at an average of ~230k files/s, with a
> std deviation of +/-1.7k files/s.
>
> Unfortunately, btrfs turns that into more heat than it ever has done
> before. It's only managing 35k files/s and the profile looks like
> this:
>
>   58.79%  [kernel]  [k] __pv_queued_spin_lock_slowpath
>    5.61%  [kernel]  [k] queued_write_lock_slowpath
>    1.65%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
>    1.38%  [kernel]  [k] reschedule_interrupt
>    1.08%  [kernel]  [k] _raw_spin_lock
>    0.92%  [kernel]  [k] __radix_tree_lookup
>    0.86%  [kernel]  [k] _raw_spin_lock_irqsave
>    0.83%  [kernel]  [k] btrfs_set_lock_blocking_rw
>
> I killed it because this would take too long to run.
>
> I reduced the concurrency down to 4-way, spinlock contention went
> down to about 40% of the CPU time.  I reduced the concurrency down
> to 2 and saw about 16% of CPU time being spent in lock contention.
>
> Throughput results:
> 				btrfs throughput
> 			2-way			4-way
> unpatched	46938.1+/-2.8e+03	40273.4+/-3.1e+03
> patched		45697.2+/-2.4e+03	49287.1+/-3e+03
>
>
> So, 2-way has not improved. If changing referenced behaviour was an
> obvious win for btrfs, we'd expect to see that here as well.
> however, because 4-way improved by 20%, I think all we're seeing is
> a slight change in lock contention levels inside btrfs. Indeed,
> looking at the profiles I see that lock contention time was reduced
> to around 32% of the total CPU used (down by about 20%):
>
>   20.79%  [kernel]  [k] __pv_queued_spin_lock_slowpath
>    3.85%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
>    3.68%  [kernel]  [k] _raw_spin_lock
>    3.40%  [kernel]  [k] queued_write_lock_slowpath
>    .....
>
> IOWs, the performance increase comes from the fact that changing
> inode cache eviction patterns causes slightly less lock contention
> in btrfs inode reclaim. IOWs, the problem that needs fixing is the
> btrfs lock contention, not the VFS cache LRU algorithms.
>
> Root cause analysis needs to be done properly before behavioural
> changes like this are proposed, people!
>

We are doing different things.  Yes when you do it all into the same subvol the 
lock contention is obviously much worse and the bottleneck, but that's not what 
I'm doing, I'm creating a subvol for each thread to reduce the lock contention 
on the root nodes.  If you retest by doing that then you will see the 
differences I was seeing.  Are you suggesting that I just made these numbers up? 
  Instead of assuming I'm an idiot and don't know how to root cause issues 
please instead ask what is different from your run versus my run.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Oct. 26, 2016, 10:20 p.m. UTC | #5
On Wed, Oct 26, 2016 at 04:03:54PM -0400, Josef Bacik wrote:
> On 10/25/2016 07:36 PM, Dave Chinner wrote:
> >So, 2-way has not improved. If changing referenced behaviour was an
> >obvious win for btrfs, we'd expect to see that here as well.
> >however, because 4-way improved by 20%, I think all we're seeing is
> >a slight change in lock contention levels inside btrfs. Indeed,
> >looking at the profiles I see that lock contention time was reduced
> >to around 32% of the total CPU used (down by about 20%):
> >
> >  20.79%  [kernel]  [k] __pv_queued_spin_lock_slowpath
> >   3.85%  [kernel]  [k] __raw_callee_save___pv_queued_spin_unlock
> >   3.68%  [kernel]  [k] _raw_spin_lock
> >   3.40%  [kernel]  [k] queued_write_lock_slowpath
> >   .....
> >
> >IOWs, the performance increase comes from the fact that changing
> >inode cache eviction patterns causes slightly less lock contention
> >in btrfs inode reclaim. IOWs, the problem that needs fixing is the
> >btrfs lock contention, not the VFS cache LRU algorithms.
> >
> >Root cause analysis needs to be done properly before behavioural
> >changes like this are proposed, people!
> >
> 
> We are doing different things.

Well, no, we're doing the same thing. Except...

> Yes when you do it all into the same
> subvol the lock contention is obviously much worse and the
> bottleneck, but that's not what I'm doing, I'm creating a subvol for
> each thread to reduce the lock contention on the root nodes.

.. you have a non-default filesystem config that you didn't mention
even in response to a /direct question/. Seriously, this is a major
configuration detail that is necessary for anyone who wants to
replicate your results!

The difference is that I'm using is the /default/ btrfs filesystem
configuration and you're using a carefully contrived structure that
is designed to work around fundamental scalability issues the
benchmark normally exposes.

This raises another question: Does this subvol setup reflect
production configurations, or is it simply a means to get the
benchmark to put enough load on the inode cache to demonstrate the
effect of changing the reference bits?

> If you
> retest by doing that then you will see the differences I was seeing.
> Are you suggesting that I just made these numbers up? 

No, I'm suggesting that you haven't explained the problem or how to
reproduce it in sufficient detail.

You haven't explained why reducing scanning (which in and of iteself
has minimal overhead and is not visible to btrfs) has such a marked
effect on the behaviour of btrfs. You haven't given us details on
the workload or storage config so we can reproduce the numbers (and
still haven't!). You're arguing that numbers far below variablity
and accuracy measurement limits are significant (i.e. the patched
numbers are better for XFS and ext4). There's lots of information
you've chosen to omit that we need to know.

To fill in the gaps, I've done some analysis, provided perf numbers
and CPU profiles, and put them into an explanation that explains why
there is a difference in performance for a default btrfs config. I
note that you've not actually addressed that analysis and the
problems it uncovers, but instead just said "I'm using a different
configuration".

What about all the btrfs users that aren't using your configuration
that will see these problems? How does the scanning change actually
benefit them more than fixing the locking issues that exist? Memory
reclaim and LRU algorithms affect everyone, not just a benchmark
configuration. We need to know a lot more about the issue at hand
to be able to evaluate whether this is the /right solution for the
problem./

> Instead of
> assuming I'm an idiot and don't know how to root cause issues please
> instead ask what is different from your run versus my run.  Thanks,

I'm assuming you're a compentent engineer, Josef, who knows that
details about benchmarks and performance measurement are extremely
important, and that someone reviewing code needs to understand why
the behaviour change had the impact it did to be able to evaluate it
properly. I'm assuming that you also know that if you understand the
root cause of a problem, you put it in the commit message so that
everyone else knows it, too.

So if you know what the root cause of the btrfs problem that
reducing reclaim scanning avoids, then please explain it.

Cheers,

Dave.
Dave Chinner Oct. 27, 2016, 12:30 a.m. UTC | #6
On Wed, Oct 26, 2016 at 11:11:35AM -0400, Josef Bacik wrote:
> On 10/25/2016 06:01 PM, Dave Chinner wrote:
> >On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote:
> >>With anything that populates the inode/dentry cache with a lot of one time use
> >>inodes we can really put a lot of pressure on the system for things we don't
> >>need to keep in cache.  It takes two runs through the LRU to evict these one use
> >>entries, and if you have a lot of memory you can end up with 10's of millions of
> >>entries in the dcache or icache that have never actually been touched since they
> >>were first instantiated, and it will take a lot of CPU and a lot of pressure to
> >>evict all of them.
> >>
> >>So instead do what we do with pagecache, only set the *REFERENCED flags if we
> >>are being used after we've been put onto the LRU.  This makes a significant
> >>difference in the system's ability to evict these useless cache entries.  With a
> >>fs_mark workload that creates 40 million files we get the following results (all
> >>in files/sec)
> >
> >What's the workload, storage, etc?
> 
> Oops sorry I thought I said it.  It's fs_mark creating 20 million
> empty files on a single NVME drive.

How big is the drive/filesystem (e.g. has impact on XFS allocation
concurrency)?  And multiple btrfs subvolumes, too, by the sound of
it. How did you set those up?  What about concurrency, directory
sizes, etc?  Can you post the fsmark command line as these details
do actually matter...

Getting the benchmark configuration to reproduce posted results
should not require playing 20 questions!

> >>The reason Btrfs has a much larger improvement is because it holds a lot more
> >>things in memory so benefits more from faster slab reclaim, but across the board
> >>is an improvement for each of the file systems.
> >
> >Less than 1% for XFS and ~1.5% for ext4 is well within the
> >run-to-run variation of fsmark. It looks like it might be slightly
> >faster, but it's not a cut-and-dried win for anything other than
> >btrfs.
> >
> 
> Sure the win in this benchmark is clearly benefiting btrfs the most,
> but I think the overall approach is sound and likely to help
> everybody in theory.

Yup, but without an explanation of why it makes such a big change to
btrfs, we can't really say what effect it's really going to have.
Why does cycling the inode a second time through the LRU make any
difference? Once we're in steady state on this workload, one or two
cycles through the LRU should make no difference at all to
performance because all the inodes are instantiated in identical
states (including the referenced bit) and so scanning treats every
inode identically. i.e. the reclaim rate (i.e. how often
evict_inode() is called) should be exactly the same and the only
difference is the length of time between the inode being put on the
LRU and when it is evicted.

Is there an order difference in reclaim as a result of earlier
reclaim? Is btrfs_evict_inode() blocking on a sleeping lock in btrfs
rather than contending on spinning locks? Is it having to wait for
transaction commits or some other metadata IO because reclaim is
happening earlier? i.e. The question I'm asking is what, exactly,
leads to such a marked performance improvement in steady state
behaviour?

I want to know because if there's behavioural changes in LRU reclaim
order having a significant effect on affecting btrfs, then there is
going to be some effects on other filesystems, too. Maybe not in
this benchmark, but we can't anticipate potential problems if we
don't understand exactly what is going on here.

> Inside FB we definitely have had problems
> where the memory pressure induced by some idi^H^H^Hprocess goes
> along and runs find / which causes us to evict real things that are
> being used rather than these one use inodes.

That's one of the problems the IO throttling in the XFS shrinker
tends to avoid. i.e. This is one of the specific cases we expect to see
on all production systems - backup applications are the common cause
of regular full filesystem traversals.

FWIW, there's an element of deja vu in this thread: that XFS inode
cache shrinker IO throttling is exactly what Chris proposed we gut
last week to solve some other FB memory reclaim problem that had no
explanation of the root cause.

(http://www.spinics.net/lists/linux-xfs/msg01541.html)

> This sort of behavior
> could possibly be mitigated by this patch, but I haven't sat down to
> figure out a reliable way to mirror this workload to test that
> theory.  Thanks

I use fsmark to create filesystems with tens of millions of small
files, then do things like run concurrent greps repeatedly over a
small portion of the directory structure (e.g. 1M cached inodes
and 4GB worth of cached data) and then run concurrent find
traversals across the entire filesystem to simulate this sort
use-once vs working set reclaim...

Cheers,

Dave.
Josef Bacik Oct. 27, 2016, 1:13 p.m. UTC | #7
On 10/26/2016 08:30 PM, Dave Chinner wrote:
> On Wed, Oct 26, 2016 at 11:11:35AM -0400, Josef Bacik wrote:
>> On 10/25/2016 06:01 PM, Dave Chinner wrote:
>>> On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote:
>>>> With anything that populates the inode/dentry cache with a lot of one time use
>>>> inodes we can really put a lot of pressure on the system for things we don't
>>>> need to keep in cache.  It takes two runs through the LRU to evict these one use
>>>> entries, and if you have a lot of memory you can end up with 10's of millions of
>>>> entries in the dcache or icache that have never actually been touched since they
>>>> were first instantiated, and it will take a lot of CPU and a lot of pressure to
>>>> evict all of them.
>>>>
>>>> So instead do what we do with pagecache, only set the *REFERENCED flags if we
>>>> are being used after we've been put onto the LRU.  This makes a significant
>>>> difference in the system's ability to evict these useless cache entries.  With a
>>>> fs_mark workload that creates 40 million files we get the following results (all
>>>> in files/sec)
>>>
>>> What's the workload, storage, etc?
>>
>> Oops sorry I thought I said it.  It's fs_mark creating 20 million
>> empty files on a single NVME drive.
>
> How big is the drive/filesystem (e.g. has impact on XFS allocation
> concurrency)?  And multiple btrfs subvolumes, too, by the sound of
> it. How did you set those up?  What about concurrency, directory
> sizes, etc?  Can you post the fsmark command line as these details
> do actually matter...
>
> Getting the benchmark configuration to reproduce posted results
> should not require playing 20 questions!

This is the disk

Disk /dev/nvme0n1: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors

This is the script

https://paste.fedoraproject.org/461874/73910147/1

It's on a 1 socket 8 core cpu with 16gib of ram.

>
>>>> The reason Btrfs has a much larger improvement is because it holds a lot more
>>>> things in memory so benefits more from faster slab reclaim, but across the board
>>>> is an improvement for each of the file systems.
>>>
>>> Less than 1% for XFS and ~1.5% for ext4 is well within the
>>> run-to-run variation of fsmark. It looks like it might be slightly
>>> faster, but it's not a cut-and-dried win for anything other than
>>> btrfs.
>>>
>>
>> Sure the win in this benchmark is clearly benefiting btrfs the most,
>> but I think the overall approach is sound and likely to help
>> everybody in theory.
>
> Yup, but without an explanation of why it makes such a big change to
> btrfs, we can't really say what effect it's really going to have.
> Why does cycling the inode a second time through the LRU make any
> difference? Once we're in steady state on this workload, one or two
> cycles through the LRU should make no difference at all to
> performance because all the inodes are instantiated in identical
> states (including the referenced bit) and so scanning treats every
> inode identically. i.e. the reclaim rate (i.e. how often
> evict_inode() is called) should be exactly the same and the only
> difference is the length of time between the inode being put on the
> LRU and when it is evicted.
>
> Is there an order difference in reclaim as a result of earlier
> reclaim? Is btrfs_evict_inode() blocking on a sleeping lock in btrfs
> rather than contending on spinning locks? Is it having to wait for
> transaction commits or some other metadata IO because reclaim is
> happening earlier? i.e. The question I'm asking is what, exactly,
> leads to such a marked performance improvement in steady state
> behaviour?

I would have seen this in my traces.  There's tooooons of places to improve 
btrfs's performance and behavior here no doubt.  But simply moving from 
pagecache to a slab shrinker shouldn't have drastically changed how we perform 
in this test.  I feel like the shrinkers need to be used differently, but I 
completely destroyed vmscan.c trying different approaches and none of them made 
a difference like this patch made.  From what I was seeing in my trace we were 
simply reclaiming less per kswapd scan iteration with the old approach vs. the 
new approach.

>
> I want to know because if there's behavioural changes in LRU reclaim
> order having a significant effect on affecting btrfs, then there is
> going to be some effects on other filesystems, too. Maybe not in
> this benchmark, but we can't anticipate potential problems if we
> don't understand exactly what is going on here.

So I'll just describe to you what I was seeing and maybe we can work out where 
we think the problem is.

1) We go at X speed until we fill up all of the memory with the various caches.
2) We lost about 15% when kswapd kicks in and it never recovered.

Doing tracing I was seeing that we would try to scan very small batches of 
objects, usually less than the batch size, because with btrfs not using 
pagecache for anything and the shrinker scanning being based on how much 
pagecache was scanned our scan totals would be very much smaller than the total 
cache size.  I originally thought this was the problem, but short of just 
forcing us to scan the whole cache every time nothing I did made any difference.

Once we'd scanned through the entire LRU at once suddenly we started reclaiming 
almost as many objects as we scanned, as opposed to < 10 items.  This is because 
of the whole referenced thing.  So we'd see a small perf bump when we did manage 
to do this, and then it would drop again once the lru was full of fresh inodes 
again.  So we'd get this saw-toothy sort of reclaim.

This is when I realized the whole REFERENCED thing was probably screwing us, so 
I went to look at what pagecache did to eliminate this problem.  They do two 
things, first start all pages on the inactive list, and second balance between 
scanning the active list vs inactive list based on the size of either list.  I 
rigged up something to do an active/inactive list inside of list_lru which again 
did basically nothing.  So I changed how the REFERENCED was marked to match how 
pagecache did things and viola, my performance was back.

I did all sorts of tracing to try and figure out what exactly it was about the 
reclaim that made everything so much slower, but with that many things going on 
it was hard to decide what was noise and what was actually causing the problem. 
Without this patch I see kswapd stay at higher CPU usage for longer, because it 
has to keep scanning things to satisfy the high watermark and it takes scanning 
multiple million object lists multiple times before it starts to make any 
progress towards reclaim.  This logically makes sense and is not ideal behavior.

>
>> Inside FB we definitely have had problems
>> where the memory pressure induced by some idi^H^H^Hprocess goes
>> along and runs find / which causes us to evict real things that are
>> being used rather than these one use inodes.
>
> That's one of the problems the IO throttling in the XFS shrinker
> tends to avoid. i.e. This is one of the specific cases we expect to see
> on all production systems - backup applications are the common cause
> of regular full filesystem traversals.
>
> FWIW, there's an element of deja vu in this thread: that XFS inode
> cache shrinker IO throttling is exactly what Chris proposed we gut
> last week to solve some other FB memory reclaim problem that had no
> explanation of the root cause.
>
> (https://urldefense.proofpoint.com/v2/url?u=http-3A__www.spinics.net_lists_linux-2Dxfs_msg01541.html&d=DQIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=sDzg6MvHymKOUgI8SFIm4Q&m=REEqbp9jT-ifN0oN83tK5X3BxEL76jmuULRORyO8Qeg&s=vlsX-YRw_2HCnXFThpOjyumOHu4JFDcSqssSQG1Rt40&e= )
>
>> This sort of behavior
>> could possibly be mitigated by this patch, but I haven't sat down to
>> figure out a reliable way to mirror this workload to test that
>> theory.  Thanks
>
> I use fsmark to create filesystems with tens of millions of small
> files, then do things like run concurrent greps repeatedly over a
> small portion of the directory structure (e.g. 1M cached inodes
> and 4GB worth of cached data) and then run concurrent find
> traversals across the entire filesystem to simulate this sort
> use-once vs working set reclaim...
>

Yeah that sounds reasonable, like I said I just haven't tried to test it as my 
numbers got bigger and I was happy with that.  I'll rig this up and see how it 
performs with my patch to see if there's a significant difference with FS's 
other than btrfs.  Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Oct. 28, 2016, 3:48 a.m. UTC | #8
On Thu, Oct 27, 2016 at 09:13:44AM -0400, Josef Bacik wrote:
> On 10/26/2016 08:30 PM, Dave Chinner wrote:
> >On Wed, Oct 26, 2016 at 11:11:35AM -0400, Josef Bacik wrote:
> >>On 10/25/2016 06:01 PM, Dave Chinner wrote:
> >>>On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote:
> >>>>With anything that populates the inode/dentry cache with a lot of one time use
> >>>>inodes we can really put a lot of pressure on the system for things we don't
> >>>>need to keep in cache.  It takes two runs through the LRU to evict these one use
> >>>>entries, and if you have a lot of memory you can end up with 10's of millions of
> >>>>entries in the dcache or icache that have never actually been touched since they
> >>>>were first instantiated, and it will take a lot of CPU and a lot of pressure to
> >>>>evict all of them.
> >>>>
> >>>>So instead do what we do with pagecache, only set the *REFERENCED flags if we
> >>>>are being used after we've been put onto the LRU.  This makes a significant
> >>>>difference in the system's ability to evict these useless cache entries.  With a
> >>>>fs_mark workload that creates 40 million files we get the following results (all
> >>>>in files/sec)
> >>>
> >>>What's the workload, storage, etc?
> >>
> >>Oops sorry I thought I said it.  It's fs_mark creating 20 million
> >>empty files on a single NVME drive.
> >
> >How big is the drive/filesystem (e.g. has impact on XFS allocation
> >concurrency)?  And multiple btrfs subvolumes, too, by the sound of
> >it. How did you set those up?  What about concurrency, directory
> >sizes, etc?  Can you post the fsmark command line as these details
> >do actually matter...
> >
> >Getting the benchmark configuration to reproduce posted results
> >should not require playing 20 questions!
> 
> This is the disk
> 
> Disk /dev/nvme0n1: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
> 
> This is the script
> 
> https://paste.fedoraproject.org/461874/73910147/1

Link no workee. This does, though:

https://paste.fedoraproject.org/461874/73910147

> 
> It's on a 1 socket 8 core cpu with 16gib of ram.

Thanks!

> >Is there an order difference in reclaim as a result of earlier
> >reclaim? Is btrfs_evict_inode() blocking on a sleeping lock in btrfs
> >rather than contending on spinning locks? Is it having to wait for
> >transaction commits or some other metadata IO because reclaim is
> >happening earlier? i.e. The question I'm asking is what, exactly,
> >leads to such a marked performance improvement in steady state
> >behaviour?
> 
> I would have seen this in my traces.  There's tooooons of places to
> improve btrfs's performance and behavior here no doubt.  But simply
> moving from pagecache to a slab shrinker shouldn't have drastically
> changed how we perform in this test.

Not sure what you mean by this. Do you mean that because of the
preceeding patches that change the page cache accounting for btrfs
metadata there is now not enough pressure being placed on the btrfs
inode cache shrinker?

FWIW, this zero length file fsmark workload produces zero page
cache pressure on XFS. If the case is that btrfs is no longer using the pa

> I feel like the shrinkers need
> to be used differently, but I completely destroyed vmscan.c trying
> different approaches and none of them made a difference like this
> patch made.  From what I was seeing in my trace we were simply
> reclaiming less per kswapd scan iteration with the old approach vs.
> the new approach.

Yup, that's what default_seeks is supposed to be used to tune - the
relative pressure that should be placed on the slab compared to
everything else.

> >I want to know because if there's behavioural changes in LRU reclaim
> >order having a significant effect on affecting btrfs, then there is
> >going to be some effects on other filesystems, too. Maybe not in
> >this benchmark, but we can't anticipate potential problems if we
> >don't understand exactly what is going on here.
> 
> So I'll just describe to you what I was seeing and maybe we can work
> out where we think the problem is.
> 
> 1) We go at X speed until we fill up all of the memory with the various caches.
> 2) We lost about 15% when kswapd kicks in and it never recovered.

That's expected behaviour (i.e. that's exactly what I see when XFS
fill memory and reclaim gates new allocations). Memory reclaim does
not come for free.

> Doing tracing I was seeing that we would try to scan very small
> batches of objects, usually less than the batch size, because with
> btrfs not using pagecache for anything and the shrinker scanning
> being based on how much pagecache was scanned our scan totals would
> be very much smaller than the total cache size.  I originally
> thought this was the problem, but short of just forcing us to scan
> the whole cache every time nothing I did made any difference.

So this is from a kernel just running your patch 5 (the referenced
bit changes) and samples were taken for about a minute at steady
state after memory filled (about 20m files into a 16-way 50m file
create with subvols running at ~130k files/s in steady state) using:

# ~/trace-cmd/trace-cmd record -e mm_shrink_slab\*

There were this many shrinker invocations, but all of them came from
kswapd (no direct reclaim at all).

$ grep 0xffff88036a2774c0 t.t |grep shrink_slab_start |wc -l
1923

The amount of reclaim each shrinker invocation did was:

# grep 0xffff88036a2774c0 t.t  |awk -e '/mm_shrink_slab_end/ { print $23 }' |sort -n |uniq -c |sort -nr
    623 1025
    481 0
    172 3075
    145 2050
    131 6150
     82 12300
     54 7175
     40 4100
     40 11275
     34 5125
     32 14350
     30 13325
     19 23575
     18 24600
      7 15375
      4 8200
      3 22550
      3 10250
      2 47150
      2 16400
      1 9225
      1 25625
      1 14349
      1 11274

I don't see any partial batches there, but I also don't see enough
reclaim pressure being put on the inode cache. I'll come back to
this, but first lets look at what how XFS behaves, because it
doesn't create any page cache pressure on these workloads:


$ grep 0xffff880094aea4c0 t.t  |awk -e '/mm_shrink_slab_end/ { print $23 }' |sort -n |uniq -c |sort -nr
   1059 0
      5 1000
      4 7045
      4 1008
      3 7970
      3 7030
      2 996
      2 995
      2 962
      2 9005
      2 8961
      2 8937
      2 8062
      2 8049
.....

Very different look to thework szies. Sorting on the reclaim batch
sizes shows this:

      1 901131
      1 874354
      1 872911
      1 847679
      1 760696
      1 616491
      1 590231
      1 542313
      1 517882
      1 513400
      1 507431
      1 479111
      1 439889
      1 433734
      1 410811
      1 388127
.....

There are massive batches of inodes being reclaimed. Let's have a
look at that one of those traces more closely:

kswapd1-862   [001] 183735.856649: mm_shrink_slab_start:
	super_cache_scan+0x0 0xffff880094aea4c0: nid: 1
	objects to shrink 882936 gfp_flags GFP_KERNEL
	pgs_scanned 32 lru_pgs 6236
	cache items 2416122 delta 24792 total_scan 907728

So here we see only 32 pages were scanned in the page cache, of
a very small 6236 pages - there is no page cache to speak of, and
the 32 pages scanned is the defined minimum for the page cache.

However, the calculated delta is still reasonable at 24792, with the
cache size at 2416122 objects. It's reasonable because the delta is
almost exactly 1% of the cache size and that's a good reclaim batch
size for a single shrinker invocation.

However, total_scan is at 907728, which means there have been lots
of shrinker invocations that have not done work due to GFP_NOFS
context, and so that work has been deferred. kswapd is invoked with
GFP_KERNEL context, so we'd expect it to do most of that work right
now:

kswapd1-862   [009] 183737.582875: mm_shrink_slab_end:
	super_cache_scan+0x0 0xffff880094aea4c0: nid: 1
	unused scan count 882936 new scan count 464 total_scan 464
	last shrinker return val 874354

And it does - it freed 874354 objects from the cache, and the
deferred scan count returned to the shrinker was only 464. I'll also
point out that the reclaim rate here is about 500,000 inodes/s (per
kswapd) so we've got plenty of headroom and scalability here.

This is behaving exactly as we'd expect if there was no page cache
pressure and substantial GFP_NOFS allocations. THe delta calculation
is in the correct ballpark, the reclaim batch size is being
determined by the amount of work we are deferring to kswapd.

Just to show what GFP_NOFS windup from direct reclaim looks like:

fs_mark-30202 [006] 183695.757995: mm_shrink_slab_start:
	super_cache_scan+0x0 0xffff880094aea4c0: nid: 1
	objects to shrink 64064 gfp_flags GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK
	pgs_scanned 32 lru_pgs 6697
	cache items 2236078 delta 21365 total_scan 85429
fs_mark-30202 [006] 183695.757995: mm_shrink_slab_end:
	super_cache_scan+0x0 0xffff880094aea4c0: nid: 1
	unused scan count 64064 new scan count 85429 total_scan 85429
	last shrinker return val 0

incoming deferred work was 64064 objects, delta added was 21365
(again, 1% of total cache size), zero objects were freed, outgoing
deferred work in now 85429 objects.

So we can see that regardless of the invocation context of the
shrinker, XFS is asking for 1% of the cache to be removed. Knowing
this, let's go back to the btrfs traces and examine that:

kswapd1-862   [013] 185883.279076: mm_shrink_slab_start:
	super_cache_scan+0x0 0xffff8803329404c0: nid: 1
	objects to shrink 1006 gfp_flags GFP_KERNEL
	pgs_scanned 203 lru_pgs 251067
	cache items 1787431 delta 2890 total_scan 3896

Ok, the first thing to note here is that the delta is much smaller
than what we see with XFS. That is also no obvious deferred work,
which we'd expect because there is no direct reclaim records in the
trace for btrfs. It's pretty clear from this that each shrinker
invocation is asking for about 0.15% of the cache to be scanned -
there's much lower pressure being put on the shrinker when compared
to XFS. Hence it will take 5-10x as many invocations of the shrinker
to do the same amount of work.

Occasionally we see the page cache get scanned enough to bring
deltas up to 1% of the cache size:

kswapd1-862   [013] 185883.943647: mm_shrink_slab_start:
	super_cache_scan+0x0 0xffff8803329404c0: nid: 1
	objects to shrink 459 gfp_flags GFP_KERNEL
	pgs_scanned 1585 lru_pgs 249194
	cache items 1784345 delta 22698 total_scan 23157

See the difference here? pgs_scanned is much higher than the
trace from than half a second ago, and now the delta reaches that
~1% of cache size number. This is an outlier, though, but what it
indicates is that the problem is with the amount of page cache
scanning being done during reclaim.

There are two ways to deal with this: fix whatever problem is
causing the page cache to be underscanned, or make the
pgs_scanned:shrinker delta ratio larger. The later can be don via
changing the shrinker seek count or modifying
/proc/sys/vm/vfs_cache_pressure to make the cache seem 10x larger
than it really is....

> Once we'd scanned through the entire LRU at once suddenly we started
> reclaiming almost as many objects as we scanned, as opposed to < 10
> items.  This is because of the whole referenced thing.

That looks like a side effect of not generating enough pressure on
the shrinker.

> So we'd see
> a small perf bump when we did manage to do this, and then it would
> drop again once the lru was full of fresh inodes again.  So we'd get
> this saw-toothy sort of reclaim.

Yup - butin normal conditions this workload should result with a
relatively even mix of referenced and unreferenced inodes on the
LRU, so this behaviour should not happen.

Indeed, I just pulled the referenced patch out of the kernel and
reran the XFS tests, and I see regular "delta 24000, freed 13000"
traces. There are much fewer huge reclaim peaks (i.e. less GFP_NOFS
windup) and no two shrinker invocations freed the exact same number
of objects.

$ grep 0xffff88033a638cc0 t.t  |awk -e '/mm_shrink_slab_end/ { print $23 }' |sort -n |uniq -c |sort -nr -k2
1 381458
1 333756
1 90265
1 88664
1 82371
1 82340
1 81849
1 80772
1 79268

i.e. changing the referenced bit behaviour has made the XFS inode
shrinker too aggressive for this workload, so rebalancing would be
necessary...

At the same time, I changed btrfs to has "seeks = 1". This shows
/substantially/ more sustained and larger amounts of pressure being
put on the btrfs inode shrinker than the referenced patch:

$ grep 0xffff88004ed79cc0: b.t  |awk -e '/mm_shrink_slab_end/ { print $23 }' |sort -n |uniq -c |sort -nr -k2
1 59021
1 56271
1 53743
1 52885
1 47992
1 44442
1 41265
1 40718
1 36038
1 34872
....

.... but it still has a really, really long tail of small reclaim
attempts so it's better but still not perfect....

Unfortunately, because default_seeks is only 2, we're going to have
to change this value  and the constants in do_shrinker_slab() to
give us more ranging to increase reclaim pressure in this manner....

> I did all sorts of tracing to try and figure out what exactly it was
> about the reclaim that made everything so much slower, but with that
> many things going on it was hard to decide what was noise and what
> was actually causing the problem. Without this patch I see kswapd
> stay at higher CPU usage for longer, because it has to keep scanning
> things to satisfy the high watermark and it takes scanning multiple
> million object lists multiple times before it starts to make any
> progress towards reclaim.  This logically makes sense and is not
> ideal behavior.

I would expect the elevated kswapd CPU is due to the fact it is
being called so many more times to do the inode cache reclaim work.
It has to do more page cache scanning because the shrinker is not
removing the memory pressure, and so it's a downward spiral of
inefficiency. let's see what happens when we get the shrinker
pressure to reliably scan 1% of the cache on every invocation....

Cheers,

Dave.

Patch
diff mbox

diff --git a/fs/dcache.c b/fs/dcache.c
index 5c7cc95..a558075 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -779,8 +779,6 @@  repeat:
 			goto kill_it;
 	}
 
-	if (!(dentry->d_flags & DCACHE_REFERENCED))
-		dentry->d_flags |= DCACHE_REFERENCED;
 	dentry_lru_add(dentry);
 
 	dentry->d_lockref.count--;
@@ -803,6 +801,13 @@  static inline void __dget_dlock(struct dentry *dentry)
 	dentry->d_lockref.count++;
 }
 
+static inline void __dget_dlock_reference(struct dentry *dentry)
+{
+	if (dentry->d_flags & DCACHE_LRU_LIST)
+		dentry->d_flags |= DCACHE_REFERENCED;
+	dentry->d_lockref.count++;
+}
+
 static inline void __dget(struct dentry *dentry)
 {
 	lockref_get(&dentry->d_lockref);
@@ -875,7 +880,7 @@  again:
 			    (alias->d_flags & DCACHE_DISCONNECTED)) {
 				discon_alias = alias;
 			} else {
-				__dget_dlock(alias);
+				__dget_dlock_reference(alias);
 				spin_unlock(&alias->d_lock);
 				return alias;
 			}
@@ -886,7 +891,7 @@  again:
 		alias = discon_alias;
 		spin_lock(&alias->d_lock);
 		if (S_ISDIR(inode->i_mode) || !d_unhashed(alias)) {
-			__dget_dlock(alias);
+			__dget_dlock_reference(alias);
 			spin_unlock(&alias->d_lock);
 			return alias;
 		}
@@ -2250,7 +2255,7 @@  struct dentry *__d_lookup(const struct dentry *parent, const struct qstr *name)
 		if (!d_same_name(dentry, parent, name))
 			goto next;
 
-		dentry->d_lockref.count++;
+		__dget_dlock_reference(dentry);
 		found = dentry;
 		spin_unlock(&dentry->d_lock);
 		break;
diff --git a/fs/inode.c b/fs/inode.c
index 7e3ef3a..5937d3a 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -796,6 +796,8 @@  repeat:
 			__wait_on_freeing_inode(inode);
 			goto repeat;
 		}
+		if (!list_empty(&inode->i_lru))
+			inode->i_state |= I_REFERENCED;
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
 		return inode;
@@ -823,6 +825,8 @@  repeat:
 			__wait_on_freeing_inode(inode);
 			goto repeat;
 		}
+		if (!list_empty(&inode->i_lru))
+			inode->i_state |= I_REFERENCED;
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
 		return inode;
@@ -1463,7 +1467,6 @@  static void iput_final(struct inode *inode)
 		drop = generic_drop_inode(inode);
 
 	if (!drop && (sb->s_flags & MS_ACTIVE)) {
-		inode->i_state |= I_REFERENCED;
 		inode_add_lru(inode);
 		spin_unlock(&inode->i_lock);
 		return;