bluestore bug#15724: bluefs _replay file with link count 0:

Varada,

I added the following assert in _truncate() but haven't yet had it fire in
our test cluster.   If you or others also want to give it a try, that
would be great.  It seems like a useful assert.

   if (h->pos < offset &&
       h->pos + h->buffer.length() > offset) {

The test case which seemed to make the problem occur more frequently for
us consists of mounting a cephfs filesystem, then starting up 32 instances
of ior on each of 4 client machines.  The ior command looks like this:

ior -a POSIX -o  /mnt/cephfs/test -b 8m -i 10

In the meantime I have been looking further at the code, and I think I see
the problem-causing sequence of events.

BlueFS implements unlink() differently than the POSIX model.  In POSIX,
the inode and disk extents for a file are not released until both the link
count reaches zero and the inode reference count (file open count) reaches
zero.  BlueFS however releases the inode and disk extents when the link
count (file->refs) reaches zero even if there are threads that still have
the file open.  

Routine _drop_link() does the inode and disk extent releases, and it sets
a File variable called 'deleted' to true, but note that it does not
otherwise clear the released inode number or disk extent fields in the
File.  It is therefore the responsibility of every thread with a FileRef
for that file to first obtain a lock (to prevent races with _drop_link())
and then look at the value of 'deleted', and only continue using the File
if 'deleted' is false.

Function BlueRocksWritableFile::Close() does not do this.  The routine
does not grab a lock; it calls GetPreallocationStatus(), and if
last_allocated_block is > 0, it calls fs->truncate() (which does get a
lock) which calls _truncate().  Routine _truncate() lowers the file size,
and calls op_file_update(), which posts an update transaction entry to the
log for a file that is already unlinked.   This is how we end up with an
inode in the transaction log that has disk extents but its link count is
zero.

Adding a check of the 'deleted' variable after the lock is taken in
_truncate() would work, but it bothers me a bit to see
BlueRocksWritableFile::Close() using the File struct as if it is still
valid.  Folks with more BlueFS experience can decide if that is a problem
or not.   They might also be able to describe situations where a file
could be unlinked while it is still open, that might lead to better test
cases which fire the assert more quickly.

I'm also a bit concerned about the integrity of the BlueFS filesystem,
should the log have to be replayed.  This problem was only seen because
the inode number wasn't reused in the log after it was released.  What if
a subsequent log transaction reallocates the inode?  The "file with link
count 0" assert would not catch it.  I guess that subsequent reallocation
should overlay the bad FileRef->fnode with a new fnode, so maybe it's
harmless?

It would be nice to have my assert() trigger to provide the "smoking gun".
 However, if I can't duplicate the problem in the next day or two, I will
probably just patch _truncate() to return immediately if 'deleted' is
true, and move on. 

What do you think, should I post this history to bug 15724, create a
separate bug, anything else?

Thanks, Kevan

On 7/16/16, 9:35 AM, "ceph-devel-owner@vger.kernel.org on behalf of Kevan
Rehm" <ceph-devel-owner@vger.kernel.org on behalf of krehm@cray.com> wrote:

>
>
>On 7/16/16, 7:16 AM, "Varada Kari" <Varada.Kari@sandisk.com> wrote:
>
>>On Saturday 16 July 2016 04:53 PM, Kevan Rehm wrote:
>>> Varada, see below.
>>>
>>> On 7/16/16, 2:23 AM, "Varada Kari" <Varada.Kari@sandisk.com> wrote:
>>>
>>>> On Saturday 16 July 2016 01:52 AM, Kevan Rehm wrote:
>>>>> Greetings,
>>>>>
>>>>> I have made a bit of progress.  Still looking for ideas/advise while
>>>>>I
>>>>> keep digging.
>>>>>
>>>>> First, the ceph version I forgot to mention last time.
>>>>>
>>>>> # ceph -v
>>>>> ceph version 11.0.0-196-g85bb43e
>>>>> (85bb43e111692989d2296a389ce45377d2297d6f)
>>>>>
>>>>>
>>>>>
>>>>> As a refresher, here is the beginning of the problem transaction
>>>>>which
>>>>> will result in an inode with 0 links.   There are actually three
>>>>>inodes
>>>>> in
>>>>> this transaction which will all end up having 0 links, inodes 109,
>>>>>102,
>>>>> and 113.
>>>>>
>>>>> 2016-07-13 16:03:03.929679 7fb6617c1800 10 bluefs _replay 0x5be000:
>>>>> txn(seq 77752 len 35269 crc 1796157826)
>>>>> 2016-07-13 16:03:03.929681 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_dir_unlink  db/000154.log
>>>>> 2016-07-13 16:03:03.929683 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_file_remove 109
>>>>> 2016-07-13 16:03:03.929684 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_dir_unlink  db/000153.log
>>>>> 2016-07-13 16:03:03.929685 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_file_remove 102
>>>>> 2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_dir_unlink  db/000151.log
>>>>> 2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_file_remove 113
>>>>> 2016-07-13 16:03:03.929687 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_dir_unlink  db/000150.log
>>>>> 2016-07-13 16:03:03.929688 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_file_remove 111
>>>>> 2016-07-13 16:03:03.929689 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_file_update  file(ino 113 size 71354938 mtime 2016-07-12
>>>>> 17:05:53.738683 bdev 1 extents
>>>>>
>>>>> 
>>>>>[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:9311354
>>>>>8
>>>>>8+
>>>>> 20
>>>>>
>>>>> 
>>>>>97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:96
>>>>>0
>>>>>49
>>>>> 56
>>>>> 16+58720256])
>>>>> 2016-07-13 16:03:03.929693 7fb6617c1800 30 bluefs _get_file ino 113 =
>>>>> 0x7fb66aafc880 (new)
>>>> When _get_file() is called with a inode number, if the file is not
>>>>found
>>>> in file_map
>>>> a new file is created and returned, hence that (new) in the log file.
>>>> These set of events
>>>> can happen in case of rename. You can see the same BlueFS::rename().
>>>>In
>>>> that case, we use the
>>>> same inode number.
>>>
>>> I'm not sure I understand, I do not believe this is a rename case.  For
>>> this to be a rename, there would have to be at least one op_dir_link()
>>> call for each op_dir_unlink() call in the sequence above.   The only
>>>code
>>> sequence I can find where an op_dir_unlink() and an op_file_remove()
>>> appear together without a corresponding op_dir_link() is when a file is
>>> being deleted.
>>>
>>> The (new) above makes my point exactly.  The op_file_remove() call
>>>above
>>> freed the block allocation for the inode and released the inode from
>>>the
>>> filemap, showing that the file->refs had to be zero at that time
>>>(routine
>>> _drop_link).   Yet the following op_file_update() call is performing a
>>> truncate on that same inode, which is not in the filemap.   So the
>>>caller
>>> has to be holding a file structure whose refs field is greater than
>>>zero.
>>> Which implies that the caller's file structure has to be stale.  The
>>>fact
>>> that the op_file_update file contents for inode 113 is exactly the same
>>>as
>>> the previous op_file_update contents prior to the op_file_remove(),
>>>except
>>> for a reduction in file size, proves I believe that the file struct
>>>used
>>> by the op_file_update() caller is stale.
>>>
>>>
>>>>> 2016-07-13 16:03:03.929695 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_file_update  file(ino 102 size 68557043 mtime 2016-07-12
>>>>> 17:05:54.615190 bdev 1 extents
>>>>>
>>>>> 
>>>>>[1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:20237516
>>>>>8
>>>>>+2
>>>>> 09
>>>>>
>>>>> 
>>>>>7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1:227
>>>>>5
>>>>>40
>>>>> 99
>>>>>
>>>>> 
>>>>>2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+2097152,
>>>>>1
>>>>>:2
>>>>> 51
>>>>>
>>>>> 
>>>>>658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032+209
>>>>>7
>>>>>15
>>>>> 2,
>>>>>
>>>>> 
>>>>>1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:29464985
>>>>>6
>>>>>+1
>>>>> 04
>>>>>
>>>>> 
>>>>>8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1:318
>>>>>7
>>>>>67
>>>>> 10
>>>>>
>>>>> 
>>>>>4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+2097152,
>>>>>1
>>>>>:3
>>>>> 42
>>>>>
>>>>> 
>>>>>884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144+209
>>>>>7
>>>>>15
>>>>> 2,
>>>>>
>>>>> 
>>>>>1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:38377881
>>>>>6
>>>>>+1
>>>>> 04
>>>>>
>>>>> 
>>>>>8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1:405
>>>>>7
>>>>>98
>>>>> 91
>>>>> 2+3145728,1:507510784+5242880])
>>>>> 2016-07-13 16:03:03.929702 7fb6617c1800 30 bluefs _get_file ino 102 =
>>>>> 0x7fb66aafbb90 (new)
>>>>> 2016-07-13 16:03:03.929703 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_file_update  file(ino 109 size 64421913 mtime 2016-07-12
>>>>> 22:41:33.028030 bdev 1 extents
>>>>> [1:512753664+31457280,1:545259520+4194304,1:550502400+36700160])
>>>>> 2016-07-13 16:03:03.929705 7fb6617c1800 30 bluefs _get_file ino 109 =
>>>>> 0x7fb66aafbe60 (new)
>>>>> ...
>>>>>
>>>>>
>>>>> The prior transaction log entry for each of the three inodes was
>>>>>another
>>>>> op_file_update() call.  The only difference between the two updates
>>>>>for
>>>>> ino 133 was file size went from 71359362 -> 71354938, 4424 less
>>>>>bytes,
>>>>> all
>>>>> other fields are identical.
>>>> Are the extents same for both the events?
>>>
>>> Yes, please compare this line with the one higher above in this email,
>>>you
>>> can see that only the file size is different.   That is true for all
>>>three
>>> of these inodes.
>>>
>>>>> 2016-07-13 16:03:03.925186 7fb6617c1800 20 bluefs _replay 0x5b4000:
>>>>> op_file_update  file(ino 113 size 71359362 mtime 2016-07-12
>>>>> 17:05:53.738683 bdev 1 extents
>>>>>
>>>>> 
>>>>>[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:9311354
>>>>>8
>>>>>8+
>>>>> 20
>>>>>
>>>>> 
>>>>>97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:96
>>>>>0
>>>>>49
>>>>> 56
>>>>> 16+58720256])
>>>>> 2016-07-13 16:03:03.925197 7fb6617c1800 30 bluefs _get_file ino 113 =
>>>>> 0x7fb66aafbb90
>>>>>
>>>>>
>>>>> Same story for inode 109, the file size changed from 71354939 ->
>>>>> 64421913,
>>>>> 6933026 less bytes, all other fields are identical.
>>>>>
>>>>> 2016-07-13 16:03:03.926101 7fb6617c1800 20 bluefs _replay 0x5ba000:
>>>>> op_file_update  file(ino 109 size 71354939 mtime 2016-07-12
>>>>> 22:41:33.028030 bdev 1 extents
>>>>> [1:512753664+31457280,1:545259520+4194304,1:550502400+36700160])
>>>>> 2016-07-13 16:03:03.926103 7fb6617c1800 30 bluefs _get_file ino 109 =
>>>>> 0x7fb66aafc910
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Same story for inode 102, the file size changed from 71359405 ->
>>>>> 68557043,
>>>>> 2802362 less bytes, all other fields are identical.
>>>>>
>>>>> 2016-07-13 16:03:03.926040 7fb6617c1800 20 bluefs _replay 0x5b9000:
>>>>> op_file_update  file(ino 102 size 71359405 mtime 2016-07-12
>>>>> 17:05:54.615190 bdev 1 extents
>>>>>
>>>>> 
>>>>>[1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:20237516
>>>>>8
>>>>>+2
>>>>> 09
>>>>>
>>>>> 
>>>>>7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1:227
>>>>>5
>>>>>40
>>>>> 99
>>>>>
>>>>> 
>>>>>2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+2097152,
>>>>>1
>>>>>:2
>>>>> 51
>>>>>
>>>>> 
>>>>>658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032+209
>>>>>7
>>>>>15
>>>>> 2,
>>>>>
>>>>> 
>>>>>1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:29464985
>>>>>6
>>>>>+1
>>>>> 04
>>>>>
>>>>> 
>>>>>8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1:318
>>>>>7
>>>>>67
>>>>> 10
>>>>>
>>>>> 
>>>>>4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+2097152,
>>>>>1
>>>>>:3
>>>>> 42
>>>>>
>>>>> 
>>>>>884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144+209
>>>>>7
>>>>>15
>>>>> 2,
>>>>>
>>>>> 
>>>>>1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:38377881
>>>>>6
>>>>>+1
>>>>> 04
>>>>>
>>>>> 
>>>>>8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1:405
>>>>>7
>>>>>98
>>>>> 91
>>>>> 2+3145728,1:507510784+5242880])
>>>>> 2016-07-13 16:03:03.926048 7fb6617c1800 30 bluefs _get_file ino 102 =
>>>>> 0x7fb66aafbe60
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> There is only one caller of op_file_update() in which the size of the
>>>>> file
>>>>> is reduced but all other fields stay the same, that is
>>>>> BlueFS::_truncate(), which in turn is only called by
>>>>>BlueFS::truncate(),
>>>>> which is only called by BlueRocksWritableFile::Close().
>>>>>
>>>>> The op_dir_unlink() and op_file_remove() calls for each of the three
>>>>> inodes seem to be coming from BlueRocksEnv::DeleteFile(), but I'm
>>>>>having
>>>>> difficulty finding where the call is coming from.
>>>> Any syscall making dir rename or file remove or directory remove will
>>>> result in BlueFS::rename or unlink
>>>> or corresponding calls and results in op_dir_* or op_file_*
>>>>operations.
>>>> rocksdb wal_manager or compaction
>>>> can trigger these calls during that process. Will take a closer look
>>>>at
>>>> the logs, if i can get something from them.
>>>>> So the problem appears to be that the file is first deleted, then it
>>>>>is
>>>>> closed, and a byproduct of the close operation  on a stale fnode is
>>>>>that
>>>>> another op_file_update() entry gets logged, recreating the deleted
>>>>>inode
>>>>> with a stale fnode, resulting in the 0 links failure.
>>>> Once the file is deleted, all the underlying extents will be cleared.
>>>> Until and unless there is race in handling on
>>>> the files from application level, Bluefs doesn't have any control over
>>>> them. If the a file(handle) is been reused after
>>>> deletion, that seems to be a bug. Will take a look at the logs to
>>>> confirm your theory.
>>>> But here, this seems to be more of rename use case. one more thing to
>>>> verify is the extents owned by the inodes mentioned.
>>>
>>> No, I don't believe this is a rename, see above.
>>>
>>> I'll note that every occurrence of this problem has been preceded by an
>>> OSD segfault.  And this problem is racy, if I increase debug logging
>>>for
>>> bluefs or bluestore, then the problem goes away.  I have 50+ bluestore
>>> segfaults and asserts in my cluster logs; I suspect many could be due
>>>to
>>> this problem, because the same range of blocks is being freed more than
>>> once.   If some other file should perform an allocation between the
>>> block-freeing in op_file_remove() and the second freeing in
>>> op_file_update(), then we could easily have multiple files with the
>>>same
>>> block allocation at the same time.
>>>
>>> I think this problem can be caught fairly easily, an assert could be
>>>added
>>> to _truncate().   The _truncate() routine is only called for files
>>>which
>>> are open, which means that the inode has to be in the filemap.   Code
>>> could be added at the front of _truncate() to look up the inode in the
>>> filemap, and if it is not there, then assert.   That would provide a
>>>stack
>>> trace showing the caller using the file struct with the stale reference
>>> count.   (It may be a few days before I have a chance to try this
>>>myself.)
>>Sure. Could you please let me know the steps to hit issue?
>>i am doing only few min runs to check the performance and some unit test
>>cases.
>>I will add the necessary asserts if i can reproduce the issue and fix it.
> 
>Thanks for the help!   I will have to get back to you on Monday, as there
>is another person on our team who runs the benchmarks that trigger this
>problem, I don't know how to run his benchmarks myself.
>
>Kevan
> 
>>
>>Varada
>>> Regards, Kevan
>>>
>>>
>>>
>>
>>PLEASE NOTE: The information contained in this electronic mail message is
>>intended only for the use of the designated recipient(s) named above. If
>>the reader of this message is not the intended recipient, you are hereby
>>notified that you have received this message in error and that any
>>review, dissemination, distribution, or copying of this message is
>>strictly prohibited. If you have received this communication in error,
>>please notify the sender by telephone or e-mail (as shown above)
>>immediately and destroy any and all copies of this message in your
>>possession (whether hard copies or electronically stored copies).
>
>--
>To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

bluestore bug#15724: bluefs _replay file with link count 0:

Commit Message

Comments

Patch