btrfs metadata reclaim behavior/performance characteristics

Hello list, 

As a result of the perf regression discovered in [0] I also took the time to 
investigate the behavior of the metadata reclaim algorithm and how it chagned
with the introduction of the ticketed reservations. My tests were performed on
the following commits: 

Pre-ticketed => c83f8effefa4 ("Btrfs: add tracepoint for adding block groups")
Ticketed => kernel version 4.11
Patched => my patch (attached), applied on top of 4.11 

The workload used was a stripped down version of generic/027 [1]. The filesystem
is created with the same condition - 256mb total size, mixed mode, only 2mb
free space, rest taken up by a large file. The canonical workload is a single
iteration of 8 workers, creating 2kb files. However, the below measurements 
have been taken by doing 10 iteration of that just to reduce noise. 

First here is latency distribution of reserve_metadata_bytes, obtained by 
the following systemtap script [2] Results are in milliseconds (ms):

pre-ticketed(ms):
value |-------------------------------------------------- count
    0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  345330
    1 |@@                                                  17221
    2 |                                                     3506
    4 |                                                      212
    8 |@@                                                  17777
   16 |@                                                    7125
   32 |                                                     1974
   64 |                                                      136
  128 |                                                        3
  256 |                                                        0
  512 |    

Ticketed:
value |-------------------------------------------------- count
    0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  311844
    1 |@                                                   11123
    2 |@                                                   10103
    4 |@                                                    8669
    8 |@                                                   10802
   16 |@@@                                                 20079
   32 |@@                                                  18015
   64 |                                                     1485
  128 |                                                      175
  256 |                                                        0
  512 |                                                        0

patched:
value |-------------------------------------------------- count
    0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  321168
    1 |@                                                   10620
    2 |@                                                    9099
    4 |@                                                    6488
    8 |@@                                                  17000
   16 |@                                                   12495
   32 |                                                     3525
   64 |                                                       54
  128 |                                                        0
  256 |                                                        0

So if anything the ticketed rework seems to have negative impact on the 
latencies, making the results a lot more variable. My patch does improve things
a little bit but it's not as before. 

Latency however is not the only important metric, here is the sum of files
created by each of 8 workers for 10 iteration with: 

                Pre-ticketed    Ticketed        Patched
                9647            9688        	 9492
                9811            9805        	 9627
                9838            9656        	 9569
                9818            9809        	 9419
                9684            9883        	 9495
                9755            9861        	 9696
                9791            9756        	 9362
                9852            9714        	 9489
                9880            9787        	 9657
                9809            9863        	 9453

average         9788          	9782          	 9525
				100%             %97.38

median          9810            9787          	 9495
                                100%             %97.02

stdev           73           	77           	 108
stdev/average   0.75%           0.79%            1.14%
stdev/median    0.75%           0.79%            1.14% 

generic/027 
runtime:        300s            800s        	225s

Further drilling down on why the increased run times I traced that to the
amount of transaction being generated. I've already described this in my 
previous email [0] so I'm not going to repeat the numbers here as well. 

Empirically I discovered that one way to reduce the latency is to actually 
limit the commit cycle to 1 and only allow all other states except transaction
commit to happen during reclaim. This is exactly the strategy which the attached
patch implements. Evidently from the numbers provided, the reduced number of 
commit cycles is only marginally worse than the the current version, yet is 
a lot faster in this particular test and reduces latency as well.  I also 
tested doing only a single pass through all flush_space states ala 
pre-ticketed version, however the results were a lot noisier and worse. 

Admittedly this latency problem likely manifests when we are close to ENOSPC
condition hence why no one has observed this so far. However, I think a 3x 
drop in performance even in an edge case with no considerable change in 
success writes during ENOSPC is a bit too much. 

I would like this email light up a discussion regarding this since we might 
not really have the most optimal solution at the moment. 

[0] https://www.spinics.net/lists/linux-btrfs/msg65355.html
[1] http://sprunge.us/HVMF
[2] http://paste.ubuntu.com/24591513/

btrfs metadata reclaim behavior/performance characteristics

Commit Message

Comments

Patch