diff mbox

bluestore bug#15724: bluefs _replay file with link count 0:

Message ID D3B38AFD.BB191%krehm@cray.com (mailing list archive)
State New, archived
Headers show

Commit Message

Kevan Rehm July 19, 2016, 2:09 p.m. UTC
Varada,

I added the following assert in _truncate() but haven't yet had it fire in
our test cluster.   If you or others also want to give it a try, that
would be great.  It seems like a useful assert.

   if (h->pos < offset &&
       h->pos + h->buffer.length() > offset) {


The test case which seemed to make the problem occur more frequently for
us consists of mounting a cephfs filesystem, then starting up 32 instances
of ior on each of 4 client machines.  The ior command looks like this:

ior -a POSIX -o  /mnt/cephfs/test -b 8m -i 10


In the meantime I have been looking further at the code, and I think I see
the problem-causing sequence of events.

BlueFS implements unlink() differently than the POSIX model.  In POSIX,
the inode and disk extents for a file are not released until both the link
count reaches zero and the inode reference count (file open count) reaches
zero.  BlueFS however releases the inode and disk extents when the link
count (file->refs) reaches zero even if there are threads that still have
the file open.  

Routine _drop_link() does the inode and disk extent releases, and it sets
a File variable called 'deleted' to true, but note that it does not
otherwise clear the released inode number or disk extent fields in the
File.  It is therefore the responsibility of every thread with a FileRef
for that file to first obtain a lock (to prevent races with _drop_link())
and then look at the value of 'deleted', and only continue using the File
if 'deleted' is false.

Function BlueRocksWritableFile::Close() does not do this.  The routine
does not grab a lock; it calls GetPreallocationStatus(), and if
last_allocated_block is > 0, it calls fs->truncate() (which does get a
lock) which calls _truncate().  Routine _truncate() lowers the file size,
and calls op_file_update(), which posts an update transaction entry to the
log for a file that is already unlinked.   This is how we end up with an
inode in the transaction log that has disk extents but its link count is
zero.

Adding a check of the 'deleted' variable after the lock is taken in
_truncate() would work, but it bothers me a bit to see
BlueRocksWritableFile::Close() using the File struct as if it is still
valid.  Folks with more BlueFS experience can decide if that is a problem
or not.   They might also be able to describe situations where a file
could be unlinked while it is still open, that might lead to better test
cases which fire the assert more quickly.

I'm also a bit concerned about the integrity of the BlueFS filesystem,
should the log have to be replayed.  This problem was only seen because
the inode number wasn't reused in the log after it was released.  What if
a subsequent log transaction reallocates the inode?  The "file with link
count 0" assert would not catch it.  I guess that subsequent reallocation
should overlay the bad FileRef->fnode with a new fnode, so maybe it's
harmless?

It would be nice to have my assert() trigger to provide the "smoking gun".
 However, if I can't duplicate the problem in the next day or two, I will
probably just patch _truncate() to return immediately if 'deleted' is
true, and move on. 

What do you think, should I post this history to bug 15724, create a
separate bug, anything else?

Thanks, Kevan

On 7/16/16, 9:35 AM, "ceph-devel-owner@vger.kernel.org on behalf of Kevan
Rehm" <ceph-devel-owner@vger.kernel.org on behalf of krehm@cray.com> wrote:

>
>
>On 7/16/16, 7:16 AM, "Varada Kari" <Varada.Kari@sandisk.com> wrote:
>
>>On Saturday 16 July 2016 04:53 PM, Kevan Rehm wrote:
>>> Varada, see below.
>>>
>>> On 7/16/16, 2:23 AM, "Varada Kari" <Varada.Kari@sandisk.com> wrote:
>>>
>>>> On Saturday 16 July 2016 01:52 AM, Kevan Rehm wrote:
>>>>> Greetings,
>>>>>
>>>>> I have made a bit of progress.  Still looking for ideas/advise while
>>>>>I
>>>>> keep digging.
>>>>>
>>>>> First, the ceph version I forgot to mention last time.
>>>>>
>>>>> # ceph -v
>>>>> ceph version 11.0.0-196-g85bb43e
>>>>> (85bb43e111692989d2296a389ce45377d2297d6f)
>>>>>
>>>>>
>>>>>
>>>>> As a refresher, here is the beginning of the problem transaction
>>>>>which
>>>>> will result in an inode with 0 links.   There are actually three
>>>>>inodes
>>>>> in
>>>>> this transaction which will all end up having 0 links, inodes 109,
>>>>>102,
>>>>> and 113.
>>>>>
>>>>> 2016-07-13 16:03:03.929679 7fb6617c1800 10 bluefs _replay 0x5be000:
>>>>> txn(seq 77752 len 35269 crc 1796157826)
>>>>> 2016-07-13 16:03:03.929681 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_dir_unlink  db/000154.log
>>>>> 2016-07-13 16:03:03.929683 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_file_remove 109
>>>>> 2016-07-13 16:03:03.929684 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_dir_unlink  db/000153.log
>>>>> 2016-07-13 16:03:03.929685 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_file_remove 102
>>>>> 2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_dir_unlink  db/000151.log
>>>>> 2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_file_remove 113
>>>>> 2016-07-13 16:03:03.929687 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_dir_unlink  db/000150.log
>>>>> 2016-07-13 16:03:03.929688 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_file_remove 111
>>>>> 2016-07-13 16:03:03.929689 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_file_update  file(ino 113 size 71354938 mtime 2016-07-12
>>>>> 17:05:53.738683 bdev 1 extents
>>>>>
>>>>> 
>>>>>[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:9311354
>>>>>8
>>>>>8+
>>>>> 20
>>>>>
>>>>> 
>>>>>97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:96
>>>>>0
>>>>>49
>>>>> 56
>>>>> 16+58720256])
>>>>> 2016-07-13 16:03:03.929693 7fb6617c1800 30 bluefs _get_file ino 113 =
>>>>> 0x7fb66aafc880 (new)
>>>> When _get_file() is called with a inode number, if the file is not
>>>>found
>>>> in file_map
>>>> a new file is created and returned, hence that (new) in the log file.
>>>> These set of events
>>>> can happen in case of rename. You can see the same BlueFS::rename().
>>>>In
>>>> that case, we use the
>>>> same inode number.
>>>
>>> I'm not sure I understand, I do not believe this is a rename case.  For
>>> this to be a rename, there would have to be at least one op_dir_link()
>>> call for each op_dir_unlink() call in the sequence above.   The only
>>>code
>>> sequence I can find where an op_dir_unlink() and an op_file_remove()
>>> appear together without a corresponding op_dir_link() is when a file is
>>> being deleted.
>>>
>>> The (new) above makes my point exactly.  The op_file_remove() call
>>>above
>>> freed the block allocation for the inode and released the inode from
>>>the
>>> filemap, showing that the file->refs had to be zero at that time
>>>(routine
>>> _drop_link).   Yet the following op_file_update() call is performing a
>>> truncate on that same inode, which is not in the filemap.   So the
>>>caller
>>> has to be holding a file structure whose refs field is greater than
>>>zero.
>>> Which implies that the caller's file structure has to be stale.  The
>>>fact
>>> that the op_file_update file contents for inode 113 is exactly the same
>>>as
>>> the previous op_file_update contents prior to the op_file_remove(),
>>>except
>>> for a reduction in file size, proves I believe that the file struct
>>>used
>>> by the op_file_update() caller is stale.
>>>
>>>
>>>>> 2016-07-13 16:03:03.929695 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_file_update  file(ino 102 size 68557043 mtime 2016-07-12
>>>>> 17:05:54.615190 bdev 1 extents
>>>>>
>>>>> 
>>>>>[1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:20237516
>>>>>8
>>>>>+2
>>>>> 09
>>>>>
>>>>> 
>>>>>7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1:227
>>>>>5
>>>>>40
>>>>> 99
>>>>>
>>>>> 
>>>>>2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+2097152,
>>>>>1
>>>>>:2
>>>>> 51
>>>>>
>>>>> 
>>>>>658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032+209
>>>>>7
>>>>>15
>>>>> 2,
>>>>>
>>>>> 
>>>>>1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:29464985
>>>>>6
>>>>>+1
>>>>> 04
>>>>>
>>>>> 
>>>>>8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1:318
>>>>>7
>>>>>67
>>>>> 10
>>>>>
>>>>> 
>>>>>4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+2097152,
>>>>>1
>>>>>:3
>>>>> 42
>>>>>
>>>>> 
>>>>>884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144+209
>>>>>7
>>>>>15
>>>>> 2,
>>>>>
>>>>> 
>>>>>1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:38377881
>>>>>6
>>>>>+1
>>>>> 04
>>>>>
>>>>> 
>>>>>8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1:405
>>>>>7
>>>>>98
>>>>> 91
>>>>> 2+3145728,1:507510784+5242880])
>>>>> 2016-07-13 16:03:03.929702 7fb6617c1800 30 bluefs _get_file ino 102 =
>>>>> 0x7fb66aafbb90 (new)
>>>>> 2016-07-13 16:03:03.929703 7fb6617c1800 20 bluefs _replay 0x5be000:
>>>>> op_file_update  file(ino 109 size 64421913 mtime 2016-07-12
>>>>> 22:41:33.028030 bdev 1 extents
>>>>> [1:512753664+31457280,1:545259520+4194304,1:550502400+36700160])
>>>>> 2016-07-13 16:03:03.929705 7fb6617c1800 30 bluefs _get_file ino 109 =
>>>>> 0x7fb66aafbe60 (new)
>>>>> ...
>>>>>
>>>>>
>>>>> The prior transaction log entry for each of the three inodes was
>>>>>another
>>>>> op_file_update() call.  The only difference between the two updates
>>>>>for
>>>>> ino 133 was file size went from 71359362 -> 71354938, 4424 less
>>>>>bytes,
>>>>> all
>>>>> other fields are identical.
>>>> Are the extents same for both the events?
>>>
>>> Yes, please compare this line with the one higher above in this email,
>>>you
>>> can see that only the file size is different.   That is true for all
>>>three
>>> of these inodes.
>>>
>>>>> 2016-07-13 16:03:03.925186 7fb6617c1800 20 bluefs _replay 0x5b4000:
>>>>> op_file_update  file(ino 113 size 71359362 mtime 2016-07-12
>>>>> 17:05:53.738683 bdev 1 extents
>>>>>
>>>>> 
>>>>>[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:9311354
>>>>>8
>>>>>8+
>>>>> 20
>>>>>
>>>>> 
>>>>>97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:96
>>>>>0
>>>>>49
>>>>> 56
>>>>> 16+58720256])
>>>>> 2016-07-13 16:03:03.925197 7fb6617c1800 30 bluefs _get_file ino 113 =
>>>>> 0x7fb66aafbb90
>>>>>
>>>>>
>>>>> Same story for inode 109, the file size changed from 71354939 ->
>>>>> 64421913,
>>>>> 6933026 less bytes, all other fields are identical.
>>>>>
>>>>> 2016-07-13 16:03:03.926101 7fb6617c1800 20 bluefs _replay 0x5ba000:
>>>>> op_file_update  file(ino 109 size 71354939 mtime 2016-07-12
>>>>> 22:41:33.028030 bdev 1 extents
>>>>> [1:512753664+31457280,1:545259520+4194304,1:550502400+36700160])
>>>>> 2016-07-13 16:03:03.926103 7fb6617c1800 30 bluefs _get_file ino 109 =
>>>>> 0x7fb66aafc910
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Same story for inode 102, the file size changed from 71359405 ->
>>>>> 68557043,
>>>>> 2802362 less bytes, all other fields are identical.
>>>>>
>>>>> 2016-07-13 16:03:03.926040 7fb6617c1800 20 bluefs _replay 0x5b9000:
>>>>> op_file_update  file(ino 102 size 71359405 mtime 2016-07-12
>>>>> 17:05:54.615190 bdev 1 extents
>>>>>
>>>>> 
>>>>>[1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:20237516
>>>>>8
>>>>>+2
>>>>> 09
>>>>>
>>>>> 
>>>>>7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1:227
>>>>>5
>>>>>40
>>>>> 99
>>>>>
>>>>> 
>>>>>2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+2097152,
>>>>>1
>>>>>:2
>>>>> 51
>>>>>
>>>>> 
>>>>>658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032+209
>>>>>7
>>>>>15
>>>>> 2,
>>>>>
>>>>> 
>>>>>1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:29464985
>>>>>6
>>>>>+1
>>>>> 04
>>>>>
>>>>> 
>>>>>8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1:318
>>>>>7
>>>>>67
>>>>> 10
>>>>>
>>>>> 
>>>>>4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+2097152,
>>>>>1
>>>>>:3
>>>>> 42
>>>>>
>>>>> 
>>>>>884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144+209
>>>>>7
>>>>>15
>>>>> 2,
>>>>>
>>>>> 
>>>>>1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:38377881
>>>>>6
>>>>>+1
>>>>> 04
>>>>>
>>>>> 
>>>>>8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1:405
>>>>>7
>>>>>98
>>>>> 91
>>>>> 2+3145728,1:507510784+5242880])
>>>>> 2016-07-13 16:03:03.926048 7fb6617c1800 30 bluefs _get_file ino 102 =
>>>>> 0x7fb66aafbe60
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> There is only one caller of op_file_update() in which the size of the
>>>>> file
>>>>> is reduced but all other fields stay the same, that is
>>>>> BlueFS::_truncate(), which in turn is only called by
>>>>>BlueFS::truncate(),
>>>>> which is only called by BlueRocksWritableFile::Close().
>>>>>
>>>>> The op_dir_unlink() and op_file_remove() calls for each of the three
>>>>> inodes seem to be coming from BlueRocksEnv::DeleteFile(), but I'm
>>>>>having
>>>>> difficulty finding where the call is coming from.
>>>> Any syscall making dir rename or file remove or directory remove will
>>>> result in BlueFS::rename or unlink
>>>> or corresponding calls and results in op_dir_* or op_file_*
>>>>operations.
>>>> rocksdb wal_manager or compaction
>>>> can trigger these calls during that process. Will take a closer look
>>>>at
>>>> the logs, if i can get something from them.
>>>>> So the problem appears to be that the file is first deleted, then it
>>>>>is
>>>>> closed, and a byproduct of the close operation  on a stale fnode is
>>>>>that
>>>>> another op_file_update() entry gets logged, recreating the deleted
>>>>>inode
>>>>> with a stale fnode, resulting in the 0 links failure.
>>>> Once the file is deleted, all the underlying extents will be cleared.
>>>> Until and unless there is race in handling on
>>>> the files from application level, Bluefs doesn't have any control over
>>>> them. If the a file(handle) is been reused after
>>>> deletion, that seems to be a bug. Will take a look at the logs to
>>>> confirm your theory.
>>>> But here, this seems to be more of rename use case. one more thing to
>>>> verify is the extents owned by the inodes mentioned.
>>>
>>> No, I don't believe this is a rename, see above.
>>>
>>> I'll note that every occurrence of this problem has been preceded by an
>>> OSD segfault.  And this problem is racy, if I increase debug logging
>>>for
>>> bluefs or bluestore, then the problem goes away.  I have 50+ bluestore
>>> segfaults and asserts in my cluster logs; I suspect many could be due
>>>to
>>> this problem, because the same range of blocks is being freed more than
>>> once.   If some other file should perform an allocation between the
>>> block-freeing in op_file_remove() and the second freeing in
>>> op_file_update(), then we could easily have multiple files with the
>>>same
>>> block allocation at the same time.
>>>
>>> I think this problem can be caught fairly easily, an assert could be
>>>added
>>> to _truncate().   The _truncate() routine is only called for files
>>>which
>>> are open, which means that the inode has to be in the filemap.   Code
>>> could be added at the front of _truncate() to look up the inode in the
>>> filemap, and if it is not there, then assert.   That would provide a
>>>stack
>>> trace showing the caller using the file struct with the stale reference
>>> count.   (It may be a few days before I have a chance to try this
>>>myself.)
>>Sure. Could you please let me know the steps to hit issue?
>>i am doing only few min runs to check the performance and some unit test
>>cases.
>>I will add the necessary asserts if i can reproduce the issue and fix it.
> 
>Thanks for the help!   I will have to get back to you on Monday, as there
>is another person on our team who runs the benchmarks that trigger this
>problem, I don't know how to run his benchmarks myself.
>
>Kevan
> 
>>
>>Varada
>>> Regards, Kevan
>>>
>>>
>>>
>>
>>PLEASE NOTE: The information contained in this electronic mail message is
>>intended only for the use of the designated recipient(s) named above. If
>>the reader of this message is not the intended recipient, you are hereby
>>notified that you have received this message in error and that any
>>review, dissemination, distribution, or copying of this message is
>>strictly prohibited. If you have received this communication in error,
>>please notify the sender by telephone or e-mail (as shown above)
>>immediately and destroy any and all copies of this message in your
>>possession (whether hard copies or electronically stored copies).
>
>--
>To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Varada Kari July 19, 2016, 2:50 p.m. UTC | #1
T24gVHVlc2RheSAxOSBKdWx5IDIwMTYgMDc6NDAgUE0sIEtldmFuIFJlaG0gd3JvdGU6DQo+IFZh
cmFkYSwNCj4NCj4gSSBhZGRlZCB0aGUgZm9sbG93aW5nIGFzc2VydCBpbiBfdHJ1bmNhdGUoKSBi
dXQgaGF2ZW4ndCB5ZXQgaGFkIGl0IGZpcmUgaW4NCj4gb3VyIHRlc3QgY2x1c3Rlci4gICBJZiB5
b3Ugb3Igb3RoZXJzIGFsc28gd2FudCB0byBnaXZlIGl0IGEgdHJ5LCB0aGF0DQo+IHdvdWxkIGJl
IGdyZWF0LiAgSXQgc2VlbXMgbGlrZSBhIHVzZWZ1bCBhc3NlcnQuDQo+DQo+IGRpZmYgLS1naXQg
YS9zcmMvb3MvYmx1ZXN0b3JlL0JsdWVGUy5jYyBiL3NyYy9vcy9ibHVlc3RvcmUvQmx1ZUZTLmNj
DQo+IGluZGV4IGRiZmQ1OWIuLjc1OWQxYWYgMTAwNjQ0DQo+IC0tLSBhL3NyYy9vcy9ibHVlc3Rv
cmUvQmx1ZUZTLmNjDQo+ICsrKyBiL3NyYy9vcy9ibHVlc3RvcmUvQmx1ZUZTLmNjDQo+IEBAIC0x
Mjk5LDYgKzEyOTksMTMgQEAgaW50IEJsdWVGUzo6X3RydW5jYXRlKEZpbGVXcml0ZXIgKmgsIHVp
bnQ2NF90DQo+IG9mZnNldCkNCj4gICAgZG91dCgxMCkgPDwgX19mdW5jX18gPDwgIiAweCIgPDwg
c3RkOjpoZXggPDwgb2Zmc2V0IDw8IHN0ZDo6ZGVjDQo+ICAgICAgICAgICAgIDw8ICIgZmlsZSAi
IDw8IGgtPmZpbGUtPmZub2RlIDw8IGRlbmRsOw0KPg0KPiArICAvLyBJZiBpbm8gaXMgbm90IGlu
IGZpbGVtYXAsIHdlIGFyZSB0cnVuY2F0aW5nIGEgZmlsZSB0aGF0IGlzIG5vdA0KPiBhbGxvY2F0
ZWQuDQo+ICsNCj4gKyAgYXV0byBwID0gZmlsZV9tYXAuZmluZChoLT5maWxlLT5mbm9kZS5pbm8p
Ow0KPiArICBpZiAocCA9PSBmaWxlX21hcC5lbmQoKSkgew0KPiArICAgIGFzc2VydCgwID09ICJf
dHJ1bmNhdGUoKTogdHJ1bmNhdGluZyBhIGZpbGUgdGhhdCBpcyBub3QgYWxsb2NhdGVkIik7DQo+
ICsgIH0NCj4gKw0KPiAgICAvLyB0cnVuY2F0ZSBvZmYgdW5mbHVzaGVkIGRhdGE/DQo+ICAgIGlm
IChoLT5wb3MgPCBvZmZzZXQgJiYNCj4gICAgICAgIGgtPnBvcyArIGgtPmJ1ZmZlci5sZW5ndGgo
KSA+IG9mZnNldCkgew0KSWYgd2UgYXJlIHVzaW5nIEZpbGVXcml0ZXIsIHdlIHdpbGwoc2hvdWxk
KSBoYXZlIGEgcmVmZXJlbmNlIG9uIHRoZSBmbm9kZS4NCkkgZG9uJ3QgdGhpbmsgdGhhdCB3aWxs
IGJlIGRlbGV0ZWQgZnJvbSB0aGUgZmlsZV9tYXAuIFdpbGwgd2FpdCBmb3IgeW91cg0KcnVuLg0K
Pg0KPiBUaGUgdGVzdCBjYXNlIHdoaWNoIHNlZW1lZCB0byBtYWtlIHRoZSBwcm9ibGVtIG9jY3Vy
IG1vcmUgZnJlcXVlbnRseSBmb3INCj4gdXMgY29uc2lzdHMgb2YgbW91bnRpbmcgYSBjZXBoZnMg
ZmlsZXN5c3RlbSwgdGhlbiBzdGFydGluZyB1cCAzMiBpbnN0YW5jZXMNCj4gb2YgaW9yIG9uIGVh
Y2ggb2YgNCBjbGllbnQgbWFjaGluZXMuICBUaGUgaW9yIGNvbW1hbmQgbG9va3MgbGlrZSB0aGlz
Og0KPg0KPiBpb3IgLWEgUE9TSVggLW/igILigIIvbW50L2NlcGhmcy90ZXN0IC1iIDhtIC1pIDEw
DQo+DQo+DQo+IEluIHRoZSBtZWFudGltZSBJIGhhdmUgYmVlbiBsb29raW5nIGZ1cnRoZXIgYXQg
dGhlIGNvZGUsIGFuZCBJIHRoaW5rIEkgc2VlDQo+IHRoZSBwcm9ibGVtLWNhdXNpbmcgc2VxdWVu
Y2Ugb2YgZXZlbnRzLg0KPg0KPiBCbHVlRlMgaW1wbGVtZW50cyB1bmxpbmsoKSBkaWZmZXJlbnRs
eSB0aGFuIHRoZSBQT1NJWCBtb2RlbC4gIEluIFBPU0lYLA0KPiB0aGUgaW5vZGUgYW5kIGRpc2sg
ZXh0ZW50cyBmb3IgYSBmaWxlIGFyZSBub3QgcmVsZWFzZWQgdW50aWwgYm90aCB0aGUgbGluaw0K
PiBjb3VudCByZWFjaGVzIHplcm8gYW5kIHRoZSBpbm9kZSByZWZlcmVuY2UgY291bnQgKGZpbGUg
b3BlbiBjb3VudCkgcmVhY2hlcw0KPiB6ZXJvLiAgQmx1ZUZTIGhvd2V2ZXIgcmVsZWFzZXMgdGhl
IGlub2RlIGFuZCBkaXNrIGV4dGVudHMgd2hlbiB0aGUgbGluaw0KPiBjb3VudCAoZmlsZS0+cmVm
cykgcmVhY2hlcyB6ZXJvIGV2ZW4gaWYgdGhlcmUgYXJlIHRocmVhZHMgdGhhdCBzdGlsbCBoYXZl
DQo+IHRoZSBmaWxlIG9wZW4uDQpvcGVuX2Zvcl9yZWFkIGFuZCBvcGVuX2Zvcl93cml0ZSB3aWxs
IHRha2UgYSByZWZlcmVuY2Ugb24gdGhlIGZpbGUgYW5kDQp3ZSB3aWxsIGluY3JlbWVudA0KbnVt
X3JlYWRlcnMgYW5kIG51bV93cml0ZXJzIG9uIHRoZSBmaWxlIHJlZmVyZW5jZS4gd2UgaGF2ZSBh
c3NlcnRzIHRvDQpjYXRjaCBpZiB3ZSBoYXZlIGFueQ0KcmVhZGVycy93cml0ZXJzIGluIGNhc2Ug
ZmlsZSBkZWxldGlvbi4NCj4gUm91dGluZSBfZHJvcF9saW5rKCkgZG9lcyB0aGUgaW5vZGUgYW5k
IGRpc2sgZXh0ZW50IHJlbGVhc2VzLCBhbmQgaXQgc2V0cw0KPiBhIEZpbGUgdmFyaWFibGUgY2Fs
bGVkICdkZWxldGVkJyB0byB0cnVlLCBidXQgbm90ZSB0aGF0IGl0IGRvZXMgbm90DQo+IG90aGVy
d2lzZSBjbGVhciB0aGUgcmVsZWFzZWQgaW5vZGUgbnVtYmVyIG9yIGRpc2sgZXh0ZW50IGZpZWxk
cyBpbiB0aGUNCj4gRmlsZS4gIEl0IGlzIHRoZXJlZm9yZSB0aGUgcmVzcG9uc2liaWxpdHkgb2Yg
ZXZlcnkgdGhyZWFkIHdpdGggYSBGaWxlUmVmDQo+IGZvciB0aGF0IGZpbGUgdG8gZmlyc3Qgb2J0
YWluIGEgbG9jayAodG8gcHJldmVudCByYWNlcyB3aXRoIF9kcm9wX2xpbmsoKSkNCj4gYW5kIHRo
ZW4gbG9vayBhdCB0aGUgdmFsdWUgb2YgJ2RlbGV0ZWQnLCBhbmQgb25seSBjb250aW51ZSB1c2lu
ZyB0aGUgRmlsZQ0KPiBpZiAnZGVsZXRlZCcgaXMgZmFsc2UuDQo+DQo+IEZ1bmN0aW9uIEJsdWVS
b2Nrc1dyaXRhYmxlRmlsZTo6Q2xvc2UoKSBkb2VzIG5vdCBkbyB0aGlzLiAgVGhlIHJvdXRpbmUN
Cj4gZG9lcyBub3QgZ3JhYiBhIGxvY2s7IGl0IGNhbGxzIEdldFByZWFsbG9jYXRpb25TdGF0dXMo
KSwgYW5kIGlmDQo+IGxhc3RfYWxsb2NhdGVkX2Jsb2NrIGlzID4gMCwgaXQgY2FsbHMgZnMtPnRy
dW5jYXRlKCkgKHdoaWNoIGRvZXMgZ2V0IGENCj4gbG9jaykgd2hpY2ggY2FsbHMgX3RydW5jYXRl
KCkuICBSb3V0aW5lIF90cnVuY2F0ZSgpIGxvd2VycyB0aGUgZmlsZSBzaXplLA0KPiBhbmQgY2Fs
bHMgb3BfZmlsZV91cGRhdGUoKSwgd2hpY2ggcG9zdHMgYW4gdXBkYXRlIHRyYW5zYWN0aW9uIGVu
dHJ5IHRvIHRoZQ0KPiBsb2cgZm9yIGEgZmlsZSB0aGF0IGlzIGFscmVhZHkgdW5saW5rZWQuICAg
VGhpcyBpcyBob3cgd2UgZW5kIHVwIHdpdGggYW4NCj4gaW5vZGUgaW4gdGhlIHRyYW5zYWN0aW9u
IGxvZyB0aGF0IGhhcyBkaXNrIGV4dGVudHMgYnV0IGl0cyBsaW5rIGNvdW50IGlzDQo+IHplcm8u
DQpZZXMuIGkgYWxzbyBmb3VuZCB0aGUgc2FtZSBjb2RlIHBhdGgsIHdoaWNoIHRydW5jYXRlcyB0
aGUgZmlsZS4gVHJ5aW5nDQp0byB3cml0ZSBzb21lDQp1bml0IHRlc3QgY2FzZXMgdG8gcnVuIGFs
bCB0aGVzZSBvcHMgaW4gYSBjb25jdXJyZW50IHdheSB0byBmaW5kIHRoZQ0KcHJvYmxlbS4NCj4g
QWRkaW5nIGEgY2hlY2sgb2YgdGhlICdkZWxldGVkJyB2YXJpYWJsZSBhZnRlciB0aGUgbG9jayBp
cyB0YWtlbiBpbg0KPiBfdHJ1bmNhdGUoKSB3b3VsZCB3b3JrLCBidXQgaXQgYm90aGVycyBtZSBh
IGJpdCB0byBzZWUNCj4gQmx1ZVJvY2tzV3JpdGFibGVGaWxlOjpDbG9zZSgpIHVzaW5nIHRoZSBG
aWxlIHN0cnVjdCBhcyBpZiBpdCBpcyBzdGlsbA0KPiB2YWxpZC4gIEZvbGtzIHdpdGggbW9yZSBC
bHVlRlMgZXhwZXJpZW5jZSBjYW4gZGVjaWRlIGlmIHRoYXQgaXMgYSBwcm9ibGVtDQo+IG9yIG5v
dC4gICBUaGV5IG1pZ2h0IGFsc28gYmUgYWJsZSB0byBkZXNjcmliZSBzaXR1YXRpb25zIHdoZXJl
IGEgZmlsZQ0KPiBjb3VsZCBiZSB1bmxpbmtlZCB3aGlsZSBpdCBpcyBzdGlsbCBvcGVuLCB0aGF0
IG1pZ2h0IGxlYWQgdG8gYmV0dGVyIHRlc3QNCj4gY2FzZXMgd2hpY2ggZmlyZSB0aGUgYXNzZXJ0
IG1vcmUgcXVpY2tseS4NCkFkZGVkIHNvbWUgYXNzZXJ0cyB0byBmaW5kIHRoZSBpc3N1ZSB1c2lu
ZyBhIHRlc3Qgd2l0aCBtdWx0aXBsZSB0aHJlYWRzLg0KU3RpbGwNCnRyeWluZyByZXByb2R1Y2Ug
dGhlIGlzc3VlLiBJIGJlbGlldmUgd2Ugc2hvdWxkIGhhdmUgYSByZWZlcmVuY2UgdG8gdGhlDQpm
aWxlLiBXaWxsIHZlcmlmeSB0aGF0DQphbnl3YXlzLg0KPg0KPiBJJ20gYWxzbyBhIGJpdCBjb25j
ZXJuZWQgYWJvdXQgdGhlIGludGVncml0eSBvZiB0aGUgQmx1ZUZTIGZpbGVzeXN0ZW0sDQo+IHNo
b3VsZCB0aGUgbG9nIGhhdmUgdG8gYmUgcmVwbGF5ZWQuICBUaGlzIHByb2JsZW0gd2FzIG9ubHkg
c2VlbiBiZWNhdXNlDQo+IHRoZSBpbm9kZSBudW1iZXIgd2Fzbid0IHJldXNlZCBpbiB0aGUgbG9n
IGFmdGVyIGl0IHdhcyByZWxlYXNlZC4gIFdoYXQgaWYNCj4gYSBzdWJzZXF1ZW50IGxvZyB0cmFu
c2FjdGlvbiByZWFsbG9jYXRlcyB0aGUgaW5vZGU/ICBUaGUgImZpbGUgd2l0aCBsaW5rDQo+IGNv
dW50IDAiIGFzc2VydCB3b3VsZCBub3QgY2F0Y2ggaXQuICBJIGd1ZXNzIHRoYXQgc3Vic2VxdWVu
dCByZWFsbG9jYXRpb24NCj4gc2hvdWxkIG92ZXJsYXkgdGhlIGJhZCBGaWxlUmVmLT5mbm9kZSB3
aXRoIGEgbmV3IGZub2RlLCBzbyBtYXliZSBpdCdzDQo+IGhhcm1sZXNzPw0KSG9wZWZ1bGx5LiBX
aWxsIHRyeSB0byBhZGQgc29tZSBkZWJ1ZyBhcm91bmQgcmVmZXJlbmNlIGNvdW50IG9mIHRoZSBm
aWxlDQphbmQNCnZlcmlmeSB0aGUgdGhlb3J5Lg0KDQo+IEl0IHdvdWxkIGJlIG5pY2UgdG8gaGF2
ZSBteSBhc3NlcnQoKSB0cmlnZ2VyIHRvIHByb3ZpZGUgdGhlICJzbW9raW5nIGd1biIuDQo+ICBI
b3dldmVyLCBpZiBJIGNhbid0IGR1cGxpY2F0ZSB0aGUgcHJvYmxlbSBpbiB0aGUgbmV4dCBkYXkg
b3IgdHdvLCBJIHdpbGwNCj4gcHJvYmFibHkganVzdCBwYXRjaCBfdHJ1bmNhdGUoKSB0byByZXR1
cm4gaW1tZWRpYXRlbHkgaWYgJ2RlbGV0ZWQnIGlzDQo+IHRydWUsIGFuZCBtb3ZlIG9uLg0KPg0K
PiBXaGF0IGRvIHlvdSB0aGluaywgc2hvdWxkIEkgcG9zdCB0aGlzIGhpc3RvcnkgdG8gYnVnIDE1
NzI0LCBjcmVhdGUgYQ0KPiBzZXBhcmF0ZSBidWcsIGFueXRoaW5nIGVsc2U/DQpQbGVhc2UgcG9z
dCB5b3VyIGZpbmRpbmdzIHRvIDE1NzI0LCBpZiB3ZSBmaW5kIGEgZGlmZmVyZW50IHByb2JsZW0g
d2UNCmNhbiBmaWxlIGENCmRpZmZlcmVudCB0aWNrZXQuDQoNClZhcmFkYQ0KPiBUaGFua3MsIEtl
dmFuDQo+DQo+IE9uIDcvMTYvMTYsIDk6MzUgQU0sICJjZXBoLWRldmVsLW93bmVyQHZnZXIua2Vy
bmVsLm9yZyBvbiBiZWhhbGYgb2YgS2V2YW4NCj4gUmVobSIgPGNlcGgtZGV2ZWwtb3duZXJAdmdl
ci5rZXJuZWwub3JnIG9uIGJlaGFsZiBvZiBrcmVobUBjcmF5LmNvbT4gd3JvdGU6DQo+DQo+Pg0K
Pj4gT24gNy8xNi8xNiwgNzoxNiBBTSwgIlZhcmFkYSBLYXJpIiA8VmFyYWRhLkthcmlAc2FuZGlz
ay5jb20+IHdyb3RlOg0KPj4NCj4+PiBPbiBTYXR1cmRheSAxNiBKdWx5IDIwMTYgMDQ6NTMgUE0s
IEtldmFuIFJlaG0gd3JvdGU6DQo+Pj4+IFZhcmFkYSwgc2VlIGJlbG93Lg0KPj4+Pg0KPj4+PiBP
biA3LzE2LzE2LCAyOjIzIEFNLCAiVmFyYWRhIEthcmkiIDxWYXJhZGEuS2FyaUBzYW5kaXNrLmNv
bT4gd3JvdGU6DQo+Pj4+DQo+Pj4+PiBPbiBTYXR1cmRheSAxNiBKdWx5IDIwMTYgMDE6NTIgQU0s
IEtldmFuIFJlaG0gd3JvdGU6DQo+Pj4+Pj4gR3JlZXRpbmdzLA0KPj4+Pj4+DQo+Pj4+Pj4gSSBo
YXZlIG1hZGUgYSBiaXQgb2YgcHJvZ3Jlc3MuICBTdGlsbCBsb29raW5nIGZvciBpZGVhcy9hZHZp
c2Ugd2hpbGUNCj4+Pj4+PiBJDQo+Pj4+Pj4ga2VlcCBkaWdnaW5nLg0KPj4+Pj4+DQo+Pj4+Pj4g
Rmlyc3QsIHRoZSBjZXBoIHZlcnNpb24gSSBmb3Jnb3QgdG8gbWVudGlvbiBsYXN0IHRpbWUuDQo+
Pj4+Pj4NCj4+Pj4+PiAjIGNlcGggLXYNCj4+Pj4+PiBjZXBoIHZlcnNpb24gMTEuMC4wLTE5Ni1n
ODViYjQzZQ0KPj4+Pj4+ICg4NWJiNDNlMTExNjkyOTg5ZDIyOTZhMzg5Y2U0NTM3N2QyMjk3ZDZm
KQ0KPj4+Pj4+DQo+Pj4+Pj4NCj4+Pj4+Pg0KPj4+Pj4+IEFzIGEgcmVmcmVzaGVyLCBoZXJlIGlz
IHRoZSBiZWdpbm5pbmcgb2YgdGhlIHByb2JsZW0gdHJhbnNhY3Rpb24NCj4+Pj4+PiB3aGljaA0K
Pj4+Pj4+IHdpbGwgcmVzdWx0IGluIGFuIGlub2RlIHdpdGggMCBsaW5rcy4gICBUaGVyZSBhcmUg
YWN0dWFsbHkgdGhyZWUNCj4+Pj4+PiBpbm9kZXMNCj4+Pj4+PiBpbg0KPj4+Pj4+IHRoaXMgdHJh
bnNhY3Rpb24gd2hpY2ggd2lsbCBhbGwgZW5kIHVwIGhhdmluZyAwIGxpbmtzLCBpbm9kZXMgMTA5
LA0KPj4+Pj4+IDEwMiwNCj4+Pj4+PiBhbmQgMTEzLg0KPj4+Pj4+DQo+Pj4+Pj4gMjAxNi0wNy0x
MyAxNjowMzowMy45Mjk2NzkgN2ZiNjYxN2MxODAwIDEwIGJsdWVmcyBfcmVwbGF5IDB4NWJlMDAw
Og0KPj4+Pj4+IHR4bihzZXEgNzc3NTIgbGVuIDM1MjY5IGNyYyAxNzk2MTU3ODI2KQ0KPj4+Pj4+
IDIwMTYtMDctMTMgMTY6MDM6MDMuOTI5NjgxIDdmYjY2MTdjMTgwMCAyMCBibHVlZnMgX3JlcGxh
eSAweDViZTAwMDoNCj4+Pj4+PiBvcF9kaXJfdW5saW5rICBkYi8wMDAxNTQubG9nDQo+Pj4+Pj4g
MjAxNi0wNy0xMyAxNjowMzowMy45Mjk2ODMgN2ZiNjYxN2MxODAwIDIwIGJsdWVmcyBfcmVwbGF5
IDB4NWJlMDAwOg0KPj4+Pj4+IG9wX2ZpbGVfcmVtb3ZlIDEwOQ0KPj4+Pj4+IDIwMTYtMDctMTMg
MTY6MDM6MDMuOTI5Njg0IDdmYjY2MTdjMTgwMCAyMCBibHVlZnMgX3JlcGxheSAweDViZTAwMDoN
Cj4+Pj4+PiBvcF9kaXJfdW5saW5rICBkYi8wMDAxNTMubG9nDQo+Pj4+Pj4gMjAxNi0wNy0xMyAx
NjowMzowMy45Mjk2ODUgN2ZiNjYxN2MxODAwIDIwIGJsdWVmcyBfcmVwbGF5IDB4NWJlMDAwOg0K
Pj4+Pj4+IG9wX2ZpbGVfcmVtb3ZlIDEwMg0KPj4+Pj4+IDIwMTYtMDctMTMgMTY6MDM6MDMuOTI5
Njg2IDdmYjY2MTdjMTgwMCAyMCBibHVlZnMgX3JlcGxheSAweDViZTAwMDoNCj4+Pj4+PiBvcF9k
aXJfdW5saW5rICBkYi8wMDAxNTEubG9nDQo+Pj4+Pj4gMjAxNi0wNy0xMyAxNjowMzowMy45Mjk2
ODYgN2ZiNjYxN2MxODAwIDIwIGJsdWVmcyBfcmVwbGF5IDB4NWJlMDAwOg0KPj4+Pj4+IG9wX2Zp
bGVfcmVtb3ZlIDExMw0KPj4+Pj4+IDIwMTYtMDctMTMgMTY6MDM6MDMuOTI5Njg3IDdmYjY2MTdj
MTgwMCAyMCBibHVlZnMgX3JlcGxheSAweDViZTAwMDoNCj4+Pj4+PiBvcF9kaXJfdW5saW5rICBk
Yi8wMDAxNTAubG9nDQo+Pj4+Pj4gMjAxNi0wNy0xMyAxNjowMzowMy45Mjk2ODggN2ZiNjYxN2Mx
ODAwIDIwIGJsdWVmcyBfcmVwbGF5IDB4NWJlMDAwOg0KPj4+Pj4+IG9wX2ZpbGVfcmVtb3ZlIDEx
MQ0KPj4+Pj4+IDIwMTYtMDctMTMgMTY6MDM6MDMuOTI5Njg5IDdmYjY2MTdjMTgwMCAyMCBibHVl
ZnMgX3JlcGxheSAweDViZTAwMDoNCj4+Pj4+PiBvcF9maWxlX3VwZGF0ZSAgZmlsZShpbm8gMTEz
IHNpemUgNzEzNTQ5MzggbXRpbWUgMjAxNi0wNy0xMg0KPj4+Pj4+IDE3OjA1OjUzLjczODY4MyBi
ZGV2IDEgZXh0ZW50cw0KPj4+Pj4+DQo+Pj4+Pj4NCj4+Pj4+PiBbMTo5MDQ5MjEwODgrMTA0ODU3
NiwxOjkxMDE2Mzk2OCsyMDk3MTUyLDE6OTE2NDU1NDI0KzIwOTcxNTIsMTo5MzExMzU0DQo+Pj4+
Pj4gOA0KPj4+Pj4+IDgrDQo+Pj4+Pj4gMjANCj4+Pj4+Pg0KPj4+Pj4+DQo+Pj4+Pj4gOTcxNTIs
MTo5Mzc0MjY5NDQrMjA5NzE1MiwxOjk0MzcxODQwMCsyMDk3MTUyLDE6OTUwMDA5ODU2KzIwOTcx
NTIsMTo5Ng0KPj4+Pj4+IDANCj4+Pj4+PiA0OQ0KPj4+Pj4+IDU2DQo+Pj4+Pj4gMTYrNTg3MjAy
NTZdKQ0KPj4+Pj4+IDIwMTYtMDctMTMgMTY6MDM6MDMuOTI5NjkzIDdmYjY2MTdjMTgwMCAzMCBi
bHVlZnMgX2dldF9maWxlIGlubyAxMTMgPQ0KPj4+Pj4+IDB4N2ZiNjZhYWZjODgwIChuZXcpDQo+
Pj4+PiBXaGVuIF9nZXRfZmlsZSgpIGlzIGNhbGxlZCB3aXRoIGEgaW5vZGUgbnVtYmVyLCBpZiB0
aGUgZmlsZSBpcyBub3QNCj4+Pj4+IGZvdW5kDQo+Pj4+PiBpbiBmaWxlX21hcA0KPj4+Pj4gYSBu
ZXcgZmlsZSBpcyBjcmVhdGVkIGFuZCByZXR1cm5lZCwgaGVuY2UgdGhhdCAobmV3KSBpbiB0aGUg
bG9nIGZpbGUuDQo+Pj4+PiBUaGVzZSBzZXQgb2YgZXZlbnRzDQo+Pj4+PiBjYW4gaGFwcGVuIGlu
IGNhc2Ugb2YgcmVuYW1lLiBZb3UgY2FuIHNlZSB0aGUgc2FtZSBCbHVlRlM6OnJlbmFtZSgpLg0K
Pj4+Pj4gSW4NCj4+Pj4+IHRoYXQgY2FzZSwgd2UgdXNlIHRoZQ0KPj4+Pj4gc2FtZSBpbm9kZSBu
dW1iZXIuDQo+Pj4+IEknbSBub3Qgc3VyZSBJIHVuZGVyc3RhbmQsIEkgZG8gbm90IGJlbGlldmUg
dGhpcyBpcyBhIHJlbmFtZSBjYXNlLiAgRm9yDQo+Pj4+IHRoaXMgdG8gYmUgYSByZW5hbWUsIHRo
ZXJlIHdvdWxkIGhhdmUgdG8gYmUgYXQgbGVhc3Qgb25lIG9wX2Rpcl9saW5rKCkNCj4+Pj4gY2Fs
bCBmb3IgZWFjaCBvcF9kaXJfdW5saW5rKCkgY2FsbCBpbiB0aGUgc2VxdWVuY2UgYWJvdmUuICAg
VGhlIG9ubHkNCj4+Pj4gY29kZQ0KPj4+PiBzZXF1ZW5jZSBJIGNhbiBmaW5kIHdoZXJlIGFuIG9w
X2Rpcl91bmxpbmsoKSBhbmQgYW4gb3BfZmlsZV9yZW1vdmUoKQ0KPj4+PiBhcHBlYXIgdG9nZXRo
ZXIgd2l0aG91dCBhIGNvcnJlc3BvbmRpbmcgb3BfZGlyX2xpbmsoKSBpcyB3aGVuIGEgZmlsZSBp
cw0KPj4+PiBiZWluZyBkZWxldGVkLg0KPj4+Pg0KPj4+PiBUaGUgKG5ldykgYWJvdmUgbWFrZXMg
bXkgcG9pbnQgZXhhY3RseS4gIFRoZSBvcF9maWxlX3JlbW92ZSgpIGNhbGwNCj4+Pj4gYWJvdmUN
Cj4+Pj4gZnJlZWQgdGhlIGJsb2NrIGFsbG9jYXRpb24gZm9yIHRoZSBpbm9kZSBhbmQgcmVsZWFz
ZWQgdGhlIGlub2RlIGZyb20NCj4+Pj4gdGhlDQo+Pj4+IGZpbGVtYXAsIHNob3dpbmcgdGhhdCB0
aGUgZmlsZS0+cmVmcyBoYWQgdG8gYmUgemVybyBhdCB0aGF0IHRpbWUNCj4+Pj4gKHJvdXRpbmUN
Cj4+Pj4gX2Ryb3BfbGluaykuICAgWWV0IHRoZSBmb2xsb3dpbmcgb3BfZmlsZV91cGRhdGUoKSBj
YWxsIGlzIHBlcmZvcm1pbmcgYQ0KPj4+PiB0cnVuY2F0ZSBvbiB0aGF0IHNhbWUgaW5vZGUsIHdo
aWNoIGlzIG5vdCBpbiB0aGUgZmlsZW1hcC4gICBTbyB0aGUNCj4+Pj4gY2FsbGVyDQo+Pj4+IGhh
cyB0byBiZSBob2xkaW5nIGEgZmlsZSBzdHJ1Y3R1cmUgd2hvc2UgcmVmcyBmaWVsZCBpcyBncmVh
dGVyIHRoYW4NCj4+Pj4gemVyby4NCj4+Pj4gV2hpY2ggaW1wbGllcyB0aGF0IHRoZSBjYWxsZXIn
cyBmaWxlIHN0cnVjdHVyZSBoYXMgdG8gYmUgc3RhbGUuICBUaGUNCj4+Pj4gZmFjdA0KPj4+PiB0
aGF0IHRoZSBvcF9maWxlX3VwZGF0ZSBmaWxlIGNvbnRlbnRzIGZvciBpbm9kZSAxMTMgaXMgZXhh
Y3RseSB0aGUgc2FtZQ0KPj4+PiBhcw0KPj4+PiB0aGUgcHJldmlvdXMgb3BfZmlsZV91cGRhdGUg
Y29udGVudHMgcHJpb3IgdG8gdGhlIG9wX2ZpbGVfcmVtb3ZlKCksDQo+Pj4+IGV4Y2VwdA0KPj4+
PiBmb3IgYSByZWR1Y3Rpb24gaW4gZmlsZSBzaXplLCBwcm92ZXMgSSBiZWxpZXZlIHRoYXQgdGhl
IGZpbGUgc3RydWN0DQo+Pj4+IHVzZWQNCj4+Pj4gYnkgdGhlIG9wX2ZpbGVfdXBkYXRlKCkgY2Fs
bGVyIGlzIHN0YWxlLg0KPj4+Pg0KPj4+Pg0KPj4+Pj4+IDIwMTYtMDctMTMgMTY6MDM6MDMuOTI5
Njk1IDdmYjY2MTdjMTgwMCAyMCBibHVlZnMgX3JlcGxheSAweDViZTAwMDoNCj4+Pj4+PiBvcF9m
aWxlX3VwZGF0ZSAgZmlsZShpbm8gMTAyIHNpemUgNjg1NTcwNDMgbXRpbWUgMjAxNi0wNy0xMg0K
Pj4+Pj4+IDE3OjA1OjU0LjYxNTE5MCBiZGV2IDEgZXh0ZW50cw0KPj4+Pj4+DQo+Pj4+Pj4NCj4+
Pj4+PiBbMTo3ODY0MzIwMCsxMDQ4NTc2LDE6MTkxODg5NDA4KzEwNDg1NzYsMToxOTcxMzIyODgr
MTA0ODU3NiwxOjIwMjM3NTE2DQo+Pj4+Pj4gOA0KPj4+Pj4+ICsyDQo+Pj4+Pj4gMDkNCj4+Pj4+
Pg0KPj4+Pj4+DQo+Pj4+Pj4gNzE1MiwxOjIwODY2NjYyNCsyMDk3MTUyLDE6MjE0OTU4MDgwKzIw
OTcxNTIsMToyMjEyNDk1MzYrMjA5NzE1MiwxOjIyNw0KPj4+Pj4+IDUNCj4+Pj4+PiA0MA0KPj4+
Pj4+IDk5DQo+Pj4+Pj4NCj4+Pj4+Pg0KPj4+Pj4+IDIrMTA0ODU3NiwxOjIzMjc4Mzg3MisyMDk3
MTUyLDE6MjM5MDc1MzI4KzIwOTcxNTIsMToyNDUzNjY3ODQrMjA5NzE1MiwNCj4+Pj4+PiAxDQo+
Pj4+Pj4gOjINCj4+Pj4+PiA1MQ0KPj4+Pj4+DQo+Pj4+Pj4NCj4+Pj4+PiA2NTgyNDArMjA5NzE1
MiwxOjI1Nzk0OTY5NisxMDQ4NTc2LDE6MjYzMTkyNTc2KzIwOTcxNTIsMToyNjk0ODQwMzIrMjA5
DQo+Pj4+Pj4gNw0KPj4+Pj4+IDE1DQo+Pj4+Pj4gMiwNCj4+Pj4+Pg0KPj4+Pj4+DQo+Pj4+Pj4g
MToyNzU3NzU0ODgrMjA5NzE1MiwxOjI4MjA2Njk0NCsyMDk3MTUyLDE6Mjg4MzU4NDAwKzIwOTcx
NTIsMToyOTQ2NDk4NQ0KPj4+Pj4+IDYNCj4+Pj4+PiArMQ0KPj4+Pj4+IDA0DQo+Pj4+Pj4NCj4+
Pj4+Pg0KPj4+Pj4+IDg1NzYsMToyOTk4OTI3MzYrMjA5NzE1MiwxOjMwNjE4NDE5MisyMDk3MTUy
LDE6MzEyNDc1NjQ4KzIwOTcxNTIsMTozMTgNCj4+Pj4+PiA3DQo+Pj4+Pj4gNjcNCj4+Pj4+PiAx
MA0KPj4+Pj4+DQo+Pj4+Pj4NCj4+Pj4+PiA0KzEwNDg1NzYsMTozMjQwMDk5ODQrMjA5NzE1Miwx
OjMzMDMwMTQ0MCsyMDk3MTUyLDE6MzM2NTkyODk2KzIwOTcxNTIsDQo+Pj4+Pj4gMQ0KPj4+Pj4+
IDozDQo+Pj4+Pj4gNDINCj4+Pj4+Pg0KPj4+Pj4+DQo+Pj4+Pj4gODg0MzUyKzIwOTcxNTIsMToz
NDkxNzU4MDgrMTA0ODU3NiwxOjM1NDQxODY4OCsyMDk3MTUyLDE6MzYwNzEwMTQ0KzIwOQ0KPj4+
Pj4+IDcNCj4+Pj4+PiAxNQ0KPj4+Pj4+IDIsDQo+Pj4+Pj4NCj4+Pj4+Pg0KPj4+Pj4+IDE6MzY3
MDAxNjAwKzEwNDg1NzYsMTozNzIyNDQ0ODArMTA0ODU3NiwxOjM3NzQ4NzM2MCsyMDk3MTUyLDE6
MzgzNzc4ODENCj4+Pj4+PiA2DQo+Pj4+Pj4gKzENCj4+Pj4+PiAwNA0KPj4+Pj4+DQo+Pj4+Pj4N
Cj4+Pj4+PiA4NTc2LDE6Mzg5MDIxNjk2KzEwNDg1NzYsMTozOTQyNjQ1NzYrMjA5NzE1MiwxOjQw
MDU1NjAzMisxMDQ4NTc2LDE6NDA1DQo+Pj4+Pj4gNw0KPj4+Pj4+IDk4DQo+Pj4+Pj4gOTENCj4+
Pj4+PiAyKzMxNDU3MjgsMTo1MDc1MTA3ODQrNTI0Mjg4MF0pDQo+Pj4+Pj4gMjAxNi0wNy0xMyAx
NjowMzowMy45Mjk3MDIgN2ZiNjYxN2MxODAwIDMwIGJsdWVmcyBfZ2V0X2ZpbGUgaW5vIDEwMiA9
DQo+Pj4+Pj4gMHg3ZmI2NmFhZmJiOTAgKG5ldykNCj4+Pj4+PiAyMDE2LTA3LTEzIDE2OjAzOjAz
LjkyOTcwMyA3ZmI2NjE3YzE4MDAgMjAgYmx1ZWZzIF9yZXBsYXkgMHg1YmUwMDA6DQo+Pj4+Pj4g
b3BfZmlsZV91cGRhdGUgIGZpbGUoaW5vIDEwOSBzaXplIDY0NDIxOTEzIG10aW1lIDIwMTYtMDct
MTINCj4+Pj4+PiAyMjo0MTozMy4wMjgwMzAgYmRldiAxIGV4dGVudHMNCj4+Pj4+PiBbMTo1MTI3
NTM2NjQrMzE0NTcyODAsMTo1NDUyNTk1MjArNDE5NDMwNCwxOjU1MDUwMjQwMCszNjcwMDE2MF0p
DQo+Pj4+Pj4gMjAxNi0wNy0xMyAxNjowMzowMy45Mjk3MDUgN2ZiNjYxN2MxODAwIDMwIGJsdWVm
cyBfZ2V0X2ZpbGUgaW5vIDEwOSA9DQo+Pj4+Pj4gMHg3ZmI2NmFhZmJlNjAgKG5ldykNCj4+Pj4+
PiAuLi4NCj4+Pj4+Pg0KPj4+Pj4+DQo+Pj4+Pj4gVGhlIHByaW9yIHRyYW5zYWN0aW9uIGxvZyBl
bnRyeSBmb3IgZWFjaCBvZiB0aGUgdGhyZWUgaW5vZGVzIHdhcw0KPj4+Pj4+IGFub3RoZXINCj4+
Pj4+PiBvcF9maWxlX3VwZGF0ZSgpIGNhbGwuICBUaGUgb25seSBkaWZmZXJlbmNlIGJldHdlZW4g
dGhlIHR3byB1cGRhdGVzDQo+Pj4+Pj4gZm9yDQo+Pj4+Pj4gaW5vIDEzMyB3YXMgZmlsZSBzaXpl
IHdlbnQgZnJvbSA3MTM1OTM2MiAtPiA3MTM1NDkzOCwgNDQyNCBsZXNzDQo+Pj4+Pj4gYnl0ZXMs
DQo+Pj4+Pj4gYWxsDQo+Pj4+Pj4gb3RoZXIgZmllbGRzIGFyZSBpZGVudGljYWwuDQo+Pj4+PiBB
cmUgdGhlIGV4dGVudHMgc2FtZSBmb3IgYm90aCB0aGUgZXZlbnRzPw0KPj4+PiBZZXMsIHBsZWFz
ZSBjb21wYXJlIHRoaXMgbGluZSB3aXRoIHRoZSBvbmUgaGlnaGVyIGFib3ZlIGluIHRoaXMgZW1h
aWwsDQo+Pj4+IHlvdQ0KPj4+PiBjYW4gc2VlIHRoYXQgb25seSB0aGUgZmlsZSBzaXplIGlzIGRp
ZmZlcmVudC4gICBUaGF0IGlzIHRydWUgZm9yIGFsbA0KPj4+PiB0aHJlZQ0KPj4+PiBvZiB0aGVz
ZSBpbm9kZXMuDQo+Pj4+DQo+Pj4+Pj4gMjAxNi0wNy0xMyAxNjowMzowMy45MjUxODYgN2ZiNjYx
N2MxODAwIDIwIGJsdWVmcyBfcmVwbGF5IDB4NWI0MDAwOg0KPj4+Pj4+IG9wX2ZpbGVfdXBkYXRl
ICBmaWxlKGlubyAxMTMgc2l6ZSA3MTM1OTM2MiBtdGltZSAyMDE2LTA3LTEyDQo+Pj4+Pj4gMTc6
MDU6NTMuNzM4NjgzIGJkZXYgMSBleHRlbnRzDQo+Pj4+Pj4NCj4+Pj4+Pg0KPj4+Pj4+IFsxOjkw
NDkyMTA4OCsxMDQ4NTc2LDE6OTEwMTYzOTY4KzIwOTcxNTIsMTo5MTY0NTU0MjQrMjA5NzE1Miwx
OjkzMTEzNTQNCj4+Pj4+PiA4DQo+Pj4+Pj4gOCsNCj4+Pj4+PiAyMA0KPj4+Pj4+DQo+Pj4+Pj4N
Cj4+Pj4+PiA5NzE1MiwxOjkzNzQyNjk0NCsyMDk3MTUyLDE6OTQzNzE4NDAwKzIwOTcxNTIsMTo5
NTAwMDk4NTYrMjA5NzE1MiwxOjk2DQo+Pj4+Pj4gMA0KPj4+Pj4+IDQ5DQo+Pj4+Pj4gNTYNCj4+
Pj4+PiAxNis1ODcyMDI1Nl0pDQo+Pj4+Pj4gMjAxNi0wNy0xMyAxNjowMzowMy45MjUxOTcgN2Zi
NjYxN2MxODAwIDMwIGJsdWVmcyBfZ2V0X2ZpbGUgaW5vIDExMyA9DQo+Pj4+Pj4gMHg3ZmI2NmFh
ZmJiOTANCj4+Pj4+Pg0KPj4+Pj4+DQo+Pj4+Pj4gU2FtZSBzdG9yeSBmb3IgaW5vZGUgMTA5LCB0
aGUgZmlsZSBzaXplIGNoYW5nZWQgZnJvbSA3MTM1NDkzOSAtPg0KPj4+Pj4+IDY0NDIxOTEzLA0K
Pj4+Pj4+IDY5MzMwMjYgbGVzcyBieXRlcywgYWxsIG90aGVyIGZpZWxkcyBhcmUgaWRlbnRpY2Fs
Lg0KPj4+Pj4+DQo+Pj4+Pj4gMjAxNi0wNy0xMyAxNjowMzowMy45MjYxMDEgN2ZiNjYxN2MxODAw
IDIwIGJsdWVmcyBfcmVwbGF5IDB4NWJhMDAwOg0KPj4+Pj4+IG9wX2ZpbGVfdXBkYXRlICBmaWxl
KGlubyAxMDkgc2l6ZSA3MTM1NDkzOSBtdGltZSAyMDE2LTA3LTEyDQo+Pj4+Pj4gMjI6NDE6MzMu
MDI4MDMwIGJkZXYgMSBleHRlbnRzDQo+Pj4+Pj4gWzE6NTEyNzUzNjY0KzMxNDU3MjgwLDE6NTQ1
MjU5NTIwKzQxOTQzMDQsMTo1NTA1MDI0MDArMzY3MDAxNjBdKQ0KPj4+Pj4+IDIwMTYtMDctMTMg
MTY6MDM6MDMuOTI2MTAzIDdmYjY2MTdjMTgwMCAzMCBibHVlZnMgX2dldF9maWxlIGlubyAxMDkg
PQ0KPj4+Pj4+IDB4N2ZiNjZhYWZjOTEwDQo+Pj4+Pj4NCj4+Pj4+Pg0KPj4+Pj4+DQo+Pj4+Pj4N
Cj4+Pj4+Pg0KPj4+Pj4+IFNhbWUgc3RvcnkgZm9yIGlub2RlIDEwMiwgdGhlIGZpbGUgc2l6ZSBj
aGFuZ2VkIGZyb20gNzEzNTk0MDUgLT4NCj4+Pj4+PiA2ODU1NzA0MywNCj4+Pj4+PiAyODAyMzYy
IGxlc3MgYnl0ZXMsIGFsbCBvdGhlciBmaWVsZHMgYXJlIGlkZW50aWNhbC4NCj4+Pj4+Pg0KPj4+
Pj4+IDIwMTYtMDctMTMgMTY6MDM6MDMuOTI2MDQwIDdmYjY2MTdjMTgwMCAyMCBibHVlZnMgX3Jl
cGxheSAweDViOTAwMDoNCj4+Pj4+PiBvcF9maWxlX3VwZGF0ZSAgZmlsZShpbm8gMTAyIHNpemUg
NzEzNTk0MDUgbXRpbWUgMjAxNi0wNy0xMg0KPj4+Pj4+IDE3OjA1OjU0LjYxNTE5MCBiZGV2IDEg
ZXh0ZW50cw0KPj4+Pj4+DQo+Pj4+Pj4NCj4+Pj4+PiBbMTo3ODY0MzIwMCsxMDQ4NTc2LDE6MTkx
ODg5NDA4KzEwNDg1NzYsMToxOTcxMzIyODgrMTA0ODU3NiwxOjIwMjM3NTE2DQo+Pj4+Pj4gOA0K
Pj4+Pj4+ICsyDQo+Pj4+Pj4gMDkNCj4+Pj4+Pg0KPj4+Pj4+DQo+Pj4+Pj4gNzE1MiwxOjIwODY2
NjYyNCsyMDk3MTUyLDE6MjE0OTU4MDgwKzIwOTcxNTIsMToyMjEyNDk1MzYrMjA5NzE1MiwxOjIy
Nw0KPj4+Pj4+IDUNCj4+Pj4+PiA0MA0KPj4+Pj4+IDk5DQo+Pj4+Pj4NCj4+Pj4+Pg0KPj4+Pj4+
IDIrMTA0ODU3NiwxOjIzMjc4Mzg3MisyMDk3MTUyLDE6MjM5MDc1MzI4KzIwOTcxNTIsMToyNDUz
NjY3ODQrMjA5NzE1MiwNCj4+Pj4+PiAxDQo+Pj4+Pj4gOjINCj4+Pj4+PiA1MQ0KPj4+Pj4+DQo+
Pj4+Pj4NCj4+Pj4+PiA2NTgyNDArMjA5NzE1MiwxOjI1Nzk0OTY5NisxMDQ4NTc2LDE6MjYzMTky
NTc2KzIwOTcxNTIsMToyNjk0ODQwMzIrMjA5DQo+Pj4+Pj4gNw0KPj4+Pj4+IDE1DQo+Pj4+Pj4g
MiwNCj4+Pj4+Pg0KPj4+Pj4+DQo+Pj4+Pj4gMToyNzU3NzU0ODgrMjA5NzE1MiwxOjI4MjA2Njk0
NCsyMDk3MTUyLDE6Mjg4MzU4NDAwKzIwOTcxNTIsMToyOTQ2NDk4NQ0KPj4+Pj4+IDYNCj4+Pj4+
PiArMQ0KPj4+Pj4+IDA0DQo+Pj4+Pj4NCj4+Pj4+Pg0KPj4+Pj4+IDg1NzYsMToyOTk4OTI3MzYr
MjA5NzE1MiwxOjMwNjE4NDE5MisyMDk3MTUyLDE6MzEyNDc1NjQ4KzIwOTcxNTIsMTozMTgNCj4+
Pj4+PiA3DQo+Pj4+Pj4gNjcNCj4+Pj4+PiAxMA0KPj4+Pj4+DQo+Pj4+Pj4NCj4+Pj4+PiA0KzEw
NDg1NzYsMTozMjQwMDk5ODQrMjA5NzE1MiwxOjMzMDMwMTQ0MCsyMDk3MTUyLDE6MzM2NTkyODk2
KzIwOTcxNTIsDQo+Pj4+Pj4gMQ0KPj4+Pj4+IDozDQo+Pj4+Pj4gNDINCj4+Pj4+Pg0KPj4+Pj4+
DQo+Pj4+Pj4gODg0MzUyKzIwOTcxNTIsMTozNDkxNzU4MDgrMTA0ODU3NiwxOjM1NDQxODY4OCsy
MDk3MTUyLDE6MzYwNzEwMTQ0KzIwOQ0KPj4+Pj4+IDcNCj4+Pj4+PiAxNQ0KPj4+Pj4+IDIsDQo+
Pj4+Pj4NCj4+Pj4+Pg0KPj4+Pj4+IDE6MzY3MDAxNjAwKzEwNDg1NzYsMTozNzIyNDQ0ODArMTA0
ODU3NiwxOjM3NzQ4NzM2MCsyMDk3MTUyLDE6MzgzNzc4ODENCj4+Pj4+PiA2DQo+Pj4+Pj4gKzEN
Cj4+Pj4+PiAwNA0KPj4+Pj4+DQo+Pj4+Pj4NCj4+Pj4+PiA4NTc2LDE6Mzg5MDIxNjk2KzEwNDg1
NzYsMTozOTQyNjQ1NzYrMjA5NzE1MiwxOjQwMDU1NjAzMisxMDQ4NTc2LDE6NDA1DQo+Pj4+Pj4g
Nw0KPj4+Pj4+IDk4DQo+Pj4+Pj4gOTENCj4+Pj4+PiAyKzMxNDU3MjgsMTo1MDc1MTA3ODQrNTI0
Mjg4MF0pDQo+Pj4+Pj4gMjAxNi0wNy0xMyAxNjowMzowMy45MjYwNDggN2ZiNjYxN2MxODAwIDMw
IGJsdWVmcyBfZ2V0X2ZpbGUgaW5vIDEwMiA9DQo+Pj4+Pj4gMHg3ZmI2NmFhZmJlNjANCj4+Pj4+
Pg0KPj4+Pj4+DQo+Pj4+Pj4NCj4+Pj4+Pg0KPj4+Pj4+IFRoZXJlIGlzIG9ubHkgb25lIGNhbGxl
ciBvZiBvcF9maWxlX3VwZGF0ZSgpIGluIHdoaWNoIHRoZSBzaXplIG9mIHRoZQ0KPj4+Pj4+IGZp
bGUNCj4+Pj4+PiBpcyByZWR1Y2VkIGJ1dCBhbGwgb3RoZXIgZmllbGRzIHN0YXkgdGhlIHNhbWUs
IHRoYXQgaXMNCj4+Pj4+PiBCbHVlRlM6Ol90cnVuY2F0ZSgpLCB3aGljaCBpbiB0dXJuIGlzIG9u
bHkgY2FsbGVkIGJ5DQo+Pj4+Pj4gQmx1ZUZTOjp0cnVuY2F0ZSgpLA0KPj4+Pj4+IHdoaWNoIGlz
IG9ubHkgY2FsbGVkIGJ5IEJsdWVSb2Nrc1dyaXRhYmxlRmlsZTo6Q2xvc2UoKS4NCj4+Pj4+Pg0K
Pj4+Pj4+IFRoZSBvcF9kaXJfdW5saW5rKCkgYW5kIG9wX2ZpbGVfcmVtb3ZlKCkgY2FsbHMgZm9y
IGVhY2ggb2YgdGhlIHRocmVlDQo+Pj4+Pj4gaW5vZGVzIHNlZW0gdG8gYmUgY29taW5nIGZyb20g
Qmx1ZVJvY2tzRW52OjpEZWxldGVGaWxlKCksIGJ1dCBJJ20NCj4+Pj4+PiBoYXZpbmcNCj4+Pj4+
PiBkaWZmaWN1bHR5IGZpbmRpbmcgd2hlcmUgdGhlIGNhbGwgaXMgY29taW5nIGZyb20uDQo+Pj4+
PiBBbnkgc3lzY2FsbCBtYWtpbmcgZGlyIHJlbmFtZSBvciBmaWxlIHJlbW92ZSBvciBkaXJlY3Rv
cnkgcmVtb3ZlIHdpbGwNCj4+Pj4+IHJlc3VsdCBpbiBCbHVlRlM6OnJlbmFtZSBvciB1bmxpbmsN
Cj4+Pj4+IG9yIGNvcnJlc3BvbmRpbmcgY2FsbHMgYW5kIHJlc3VsdHMgaW4gb3BfZGlyXyogb3Ig
b3BfZmlsZV8qDQo+Pj4+PiBvcGVyYXRpb25zLg0KPj4+Pj4gcm9ja3NkYiB3YWxfbWFuYWdlciBv
ciBjb21wYWN0aW9uDQo+Pj4+PiBjYW4gdHJpZ2dlciB0aGVzZSBjYWxscyBkdXJpbmcgdGhhdCBw
cm9jZXNzLiBXaWxsIHRha2UgYSBjbG9zZXIgbG9vaw0KPj4+Pj4gYXQNCj4+Pj4+IHRoZSBsb2dz
LCBpZiBpIGNhbiBnZXQgc29tZXRoaW5nIGZyb20gdGhlbS4NCj4+Pj4+PiBTbyB0aGUgcHJvYmxl
bSBhcHBlYXJzIHRvIGJlIHRoYXQgdGhlIGZpbGUgaXMgZmlyc3QgZGVsZXRlZCwgdGhlbiBpdA0K
Pj4+Pj4+IGlzDQo+Pj4+Pj4gY2xvc2VkLCBhbmQgYSBieXByb2R1Y3Qgb2YgdGhlIGNsb3NlIG9w
ZXJhdGlvbiAgb24gYSBzdGFsZSBmbm9kZSBpcw0KPj4+Pj4+IHRoYXQNCj4+Pj4+PiBhbm90aGVy
IG9wX2ZpbGVfdXBkYXRlKCkgZW50cnkgZ2V0cyBsb2dnZWQsIHJlY3JlYXRpbmcgdGhlIGRlbGV0
ZWQNCj4+Pj4+PiBpbm9kZQ0KPj4+Pj4+IHdpdGggYSBzdGFsZSBmbm9kZSwgcmVzdWx0aW5nIGlu
IHRoZSAwIGxpbmtzIGZhaWx1cmUuDQo+Pj4+PiBPbmNlIHRoZSBmaWxlIGlzIGRlbGV0ZWQsIGFs
bCB0aGUgdW5kZXJseWluZyBleHRlbnRzIHdpbGwgYmUgY2xlYXJlZC4NCj4+Pj4+IFVudGlsIGFu
ZCB1bmxlc3MgdGhlcmUgaXMgcmFjZSBpbiBoYW5kbGluZyBvbg0KPj4+Pj4gdGhlIGZpbGVzIGZy
b20gYXBwbGljYXRpb24gbGV2ZWwsIEJsdWVmcyBkb2Vzbid0IGhhdmUgYW55IGNvbnRyb2wgb3Zl
cg0KPj4+Pj4gdGhlbS4gSWYgdGhlIGEgZmlsZShoYW5kbGUpIGlzIGJlZW4gcmV1c2VkIGFmdGVy
DQo+Pj4+PiBkZWxldGlvbiwgdGhhdCBzZWVtcyB0byBiZSBhIGJ1Zy4gV2lsbCB0YWtlIGEgbG9v
ayBhdCB0aGUgbG9ncyB0bw0KPj4+Pj4gY29uZmlybSB5b3VyIHRoZW9yeS4NCj4+Pj4+IEJ1dCBo
ZXJlLCB0aGlzIHNlZW1zIHRvIGJlIG1vcmUgb2YgcmVuYW1lIHVzZSBjYXNlLiBvbmUgbW9yZSB0
aGluZyB0bw0KPj4+Pj4gdmVyaWZ5IGlzIHRoZSBleHRlbnRzIG93bmVkIGJ5IHRoZSBpbm9kZXMg
bWVudGlvbmVkLg0KPj4+PiBObywgSSBkb24ndCBiZWxpZXZlIHRoaXMgaXMgYSByZW5hbWUsIHNl
ZSBhYm92ZS4NCj4+Pj4NCj4+Pj4gSSdsbCBub3RlIHRoYXQgZXZlcnkgb2NjdXJyZW5jZSBvZiB0
aGlzIHByb2JsZW0gaGFzIGJlZW4gcHJlY2VkZWQgYnkgYW4NCj4+Pj4gT1NEIHNlZ2ZhdWx0LiAg
QW5kIHRoaXMgcHJvYmxlbSBpcyByYWN5LCBpZiBJIGluY3JlYXNlIGRlYnVnIGxvZ2dpbmcNCj4+
Pj4gZm9yDQo+Pj4+IGJsdWVmcyBvciBibHVlc3RvcmUsIHRoZW4gdGhlIHByb2JsZW0gZ29lcyBh
d2F5LiAgSSBoYXZlIDUwKyBibHVlc3RvcmUNCj4+Pj4gc2VnZmF1bHRzIGFuZCBhc3NlcnRzIGlu
IG15IGNsdXN0ZXIgbG9nczsgSSBzdXNwZWN0IG1hbnkgY291bGQgYmUgZHVlDQo+Pj4+IHRvDQo+
Pj4+IHRoaXMgcHJvYmxlbSwgYmVjYXVzZSB0aGUgc2FtZSByYW5nZSBvZiBibG9ja3MgaXMgYmVp
bmcgZnJlZWQgbW9yZSB0aGFuDQo+Pj4+IG9uY2UuICAgSWYgc29tZSBvdGhlciBmaWxlIHNob3Vs
ZCBwZXJmb3JtIGFuIGFsbG9jYXRpb24gYmV0d2VlbiB0aGUNCj4+Pj4gYmxvY2stZnJlZWluZyBp
biBvcF9maWxlX3JlbW92ZSgpIGFuZCB0aGUgc2Vjb25kIGZyZWVpbmcgaW4NCj4+Pj4gb3BfZmls
ZV91cGRhdGUoKSwgdGhlbiB3ZSBjb3VsZCBlYXNpbHkgaGF2ZSBtdWx0aXBsZSBmaWxlcyB3aXRo
IHRoZQ0KPj4+PiBzYW1lDQo+Pj4+IGJsb2NrIGFsbG9jYXRpb24gYXQgdGhlIHNhbWUgdGltZS4N
Cj4+Pj4NCj4+Pj4gSSB0aGluayB0aGlzIHByb2JsZW0gY2FuIGJlIGNhdWdodCBmYWlybHkgZWFz
aWx5LCBhbiBhc3NlcnQgY291bGQgYmUNCj4+Pj4gYWRkZWQNCj4+Pj4gdG8gX3RydW5jYXRlKCku
ICAgVGhlIF90cnVuY2F0ZSgpIHJvdXRpbmUgaXMgb25seSBjYWxsZWQgZm9yIGZpbGVzDQo+Pj4+
IHdoaWNoDQo+Pj4+IGFyZSBvcGVuLCB3aGljaCBtZWFucyB0aGF0IHRoZSBpbm9kZSBoYXMgdG8g
YmUgaW4gdGhlIGZpbGVtYXAuICAgQ29kZQ0KPj4+PiBjb3VsZCBiZSBhZGRlZCBhdCB0aGUgZnJv
bnQgb2YgX3RydW5jYXRlKCkgdG8gbG9vayB1cCB0aGUgaW5vZGUgaW4gdGhlDQo+Pj4+IGZpbGVt
YXAsIGFuZCBpZiBpdCBpcyBub3QgdGhlcmUsIHRoZW4gYXNzZXJ0LiAgIFRoYXQgd291bGQgcHJv
dmlkZSBhDQo+Pj4+IHN0YWNrDQo+Pj4+IHRyYWNlIHNob3dpbmcgdGhlIGNhbGxlciB1c2luZyB0
aGUgZmlsZSBzdHJ1Y3Qgd2l0aCB0aGUgc3RhbGUgcmVmZXJlbmNlDQo+Pj4+IGNvdW50LiAgIChJ
dCBtYXkgYmUgYSBmZXcgZGF5cyBiZWZvcmUgSSBoYXZlIGEgY2hhbmNlIHRvIHRyeSB0aGlzDQo+
Pj4+IG15c2VsZi4pDQo+Pj4gU3VyZS4gQ291bGQgeW91IHBsZWFzZSBsZXQgbWUga25vdyB0aGUg
c3RlcHMgdG8gaGl0IGlzc3VlPw0KPj4+IGkgYW0gZG9pbmcgb25seSBmZXcgbWluIHJ1bnMgdG8g
Y2hlY2sgdGhlIHBlcmZvcm1hbmNlIGFuZCBzb21lIHVuaXQgdGVzdA0KPj4+IGNhc2VzLg0KPj4+
IEkgd2lsbCBhZGQgdGhlIG5lY2Vzc2FyeSBhc3NlcnRzIGlmIGkgY2FuIHJlcHJvZHVjZSB0aGUg
aXNzdWUgYW5kIGZpeCBpdC4NCj4+IFRoYW5rcyBmb3IgdGhlIGhlbHAhICAgSSB3aWxsIGhhdmUg
dG8gZ2V0IGJhY2sgdG8geW91IG9uIE1vbmRheSwgYXMgdGhlcmUNCj4+IGlzIGFub3RoZXIgcGVy
c29uIG9uIG91ciB0ZWFtIHdobyBydW5zIHRoZSBiZW5jaG1hcmtzIHRoYXQgdHJpZ2dlciB0aGlz
DQo+PiBwcm9ibGVtLCBJIGRvbid0IGtub3cgaG93IHRvIHJ1biBoaXMgYmVuY2htYXJrcyBteXNl
bGYuDQo+Pg0KPj4gS2V2YW4NCj4+DQo+Pj4gVmFyYWRhDQo+Pj4+IFJlZ2FyZHMsIEtldmFuDQo+
Pj4+DQo+Pj4+DQo+Pj4+DQo+Pj4gUExFQVNFIE5PVEU6IFRoZSBpbmZvcm1hdGlvbiBjb250YWlu
ZWQgaW4gdGhpcyBlbGVjdHJvbmljIG1haWwgbWVzc2FnZSBpcw0KPj4+IGludGVuZGVkIG9ubHkg
Zm9yIHRoZSB1c2Ugb2YgdGhlIGRlc2lnbmF0ZWQgcmVjaXBpZW50KHMpIG5hbWVkIGFib3ZlLiBJ
Zg0KPj4+IHRoZSByZWFkZXIgb2YgdGhpcyBtZXNzYWdlIGlzIG5vdCB0aGUgaW50ZW5kZWQgcmVj
aXBpZW50LCB5b3UgYXJlIGhlcmVieQ0KPj4+IG5vdGlmaWVkIHRoYXQgeW91IGhhdmUgcmVjZWl2
ZWQgdGhpcyBtZXNzYWdlIGluIGVycm9yIGFuZCB0aGF0IGFueQ0KPj4+IHJldmlldywgZGlzc2Vt
aW5hdGlvbiwgZGlzdHJpYnV0aW9uLCBvciBjb3B5aW5nIG9mIHRoaXMgbWVzc2FnZSBpcw0KPj4+
IHN0cmljdGx5IHByb2hpYml0ZWQuIElmIHlvdSBoYXZlIHJlY2VpdmVkIHRoaXMgY29tbXVuaWNh
dGlvbiBpbiBlcnJvciwNCj4+PiBwbGVhc2Ugbm90aWZ5IHRoZSBzZW5kZXIgYnkgdGVsZXBob25l
IG9yIGUtbWFpbCAoYXMgc2hvd24gYWJvdmUpDQo+Pj4gaW1tZWRpYXRlbHkgYW5kIGRlc3Ryb3kg
YW55IGFuZCBhbGwgY29waWVzIG9mIHRoaXMgbWVzc2FnZSBpbiB5b3VyDQo+Pj4gcG9zc2Vzc2lv
biAod2hldGhlciBoYXJkIGNvcGllcyBvciBlbGVjdHJvbmljYWxseSBzdG9yZWQgY29waWVzKS4N
Cj4+IC0tDQo+PiBUbyB1bnN1YnNjcmliZSBmcm9tIHRoaXMgbGlzdDogc2VuZCB0aGUgbGluZSAi
dW5zdWJzY3JpYmUgY2VwaC1kZXZlbCIgaW4NCj4+IHRoZSBib2R5IG9mIGEgbWVzc2FnZSB0byBt
YWpvcmRvbW9Admdlci5rZXJuZWwub3JnDQo+PiBNb3JlIG1ham9yZG9tbyBpbmZvIGF0ICBodHRw
Oi8vdmdlci5rZXJuZWwub3JnL21ham9yZG9tby1pbmZvLmh0bWwNCj4NCg0KUExFQVNFIE5PVEU6
IFRoZSBpbmZvcm1hdGlvbiBjb250YWluZWQgaW4gdGhpcyBlbGVjdHJvbmljIG1haWwgbWVzc2Fn
ZSBpcyBpbnRlbmRlZCBvbmx5IGZvciB0aGUgdXNlIG9mIHRoZSBkZXNpZ25hdGVkIHJlY2lwaWVu
dChzKSBuYW1lZCBhYm92ZS4gSWYgdGhlIHJlYWRlciBvZiB0aGlzIG1lc3NhZ2UgaXMgbm90IHRo
ZSBpbnRlbmRlZCByZWNpcGllbnQsIHlvdSBhcmUgaGVyZWJ5IG5vdGlmaWVkIHRoYXQgeW91IGhh
dmUgcmVjZWl2ZWQgdGhpcyBtZXNzYWdlIGluIGVycm9yIGFuZCB0aGF0IGFueSByZXZpZXcsIGRp
c3NlbWluYXRpb24sIGRpc3RyaWJ1dGlvbiwgb3IgY29weWluZyBvZiB0aGlzIG1lc3NhZ2UgaXMg
c3RyaWN0bHkgcHJvaGliaXRlZC4gSWYgeW91IGhhdmUgcmVjZWl2ZWQgdGhpcyBjb21tdW5pY2F0
aW9uIGluIGVycm9yLCBwbGVhc2Ugbm90aWZ5IHRoZSBzZW5kZXIgYnkgdGVsZXBob25lIG9yIGUt
bWFpbCAoYXMgc2hvd24gYWJvdmUpIGltbWVkaWF0ZWx5IGFuZCBkZXN0cm95IGFueSBhbmQgYWxs
IGNvcGllcyBvZiB0aGlzIG1lc3NhZ2UgaW4geW91ciBwb3NzZXNzaW9uICh3aGV0aGVyIGhhcmQg
Y29waWVzIG9yIGVsZWN0cm9uaWNhbGx5IHN0b3JlZCBjb3BpZXMpLg0K
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kevan Rehm July 19, 2016, 3:08 p.m. UTC | #2
Varada,

> If we are using FileWriter, we will(should) have a reference on the

>fnode.

> I don't think that will be deleted from the file_map. Will wait for your

> run.

>



> open_for_read and open_for_write will take a reference on the file and

>we will increment

> num_readers and num_writers on the file reference. we have asserts to

>catch if we have any

> readers/writers in case file deletion.



I had hoped that, but can't find them in the code, unless they were added
recently.  See routine _drop_link(), when the link count hits zero, it
proceeds immediately with the inode/extent deletion.  It does have an
assert there checking file->num_reading, but (in my copy) no checks for
num_readers and num_writers.  There are also no checks of the 'nref' in
RefCountedObject from which File inherits.    If you do find any of these
elsewhere that provide protection, please let me know, I'd like to
understand this code better.

Thanks for your help digging into this, and trying to reproduce it.  I'll
update the bug shortly.

Kevan


On 7/19/16, 9:50 AM, "Varada Kari" <Varada.Kari@sandisk.com> wrote:

>On Tuesday 19 July 2016 07:40 PM, Kevan Rehm wrote:

>> Varada,

>>

>> I added the following assert in _truncate() but haven't yet had it fire

>>in

>> our test cluster.   If you or others also want to give it a try, that

>> would be great.  It seems like a useful assert.

>>

>> diff --git a/src/os/bluestore/BlueFS.cc b/src/os/bluestore/BlueFS.cc

>> index dbfd59b..759d1af 100644

>> --- a/src/os/bluestore/BlueFS.cc

>> +++ b/src/os/bluestore/BlueFS.cc

>> @@ -1299,6 +1299,13 @@ int BlueFS::_truncate(FileWriter *h, uint64_t

>> offset)

>>    dout(10) << __func__ << " 0x" << std::hex << offset << std::dec

>>             << " file " << h->file->fnode << dendl;

>>

>> +  // If ino is not in filemap, we are truncating a file that is not

>> allocated.

>> +

>> +  auto p = file_map.find(h->file->fnode.ino);

>> +  if (p == file_map.end()) {

>> +    assert(0 == "_truncate(): truncating a file that is not

>>allocated");

>> +  }

>> +

>>    // truncate off unflushed data?

>>    if (h->pos < offset &&

>>        h->pos + h->buffer.length() > offset) {

>If we are using FileWriter, we will(should) have a reference on the fnode.

>I don't think that will be deleted from the file_map. Will wait for your

>run.

>>

>> The test case which seemed to make the problem occur more frequently for

>> us consists of mounting a cephfs filesystem, then starting up 32

>>instances

>> of ior on each of 4 client machines.  The ior command looks like this:

>>

>> ior -a POSIX -o  /mnt/cephfs/test -b 8m -i 10

>>

>>

>> In the meantime I have been looking further at the code, and I think I

>>see

>> the problem-causing sequence of events.

>>

>> BlueFS implements unlink() differently than the POSIX model.  In POSIX,

>> the inode and disk extents for a file are not released until both the

>>link

>> count reaches zero and the inode reference count (file open count)

>>reaches

>> zero.  BlueFS however releases the inode and disk extents when the link

>> count (file->refs) reaches zero even if there are threads that still

>>have

>> the file open.

>open_for_read and open_for_write will take a reference on the file and

>we will increment

>num_readers and num_writers on the file reference. we have asserts to

>catch if we have any

>readers/writers in case file deletion.

>> Routine _drop_link() does the inode and disk extent releases, and it

>>sets

>> a File variable called 'deleted' to true, but note that it does not

>> otherwise clear the released inode number or disk extent fields in the

>> File.  It is therefore the responsibility of every thread with a FileRef

>> for that file to first obtain a lock (to prevent races with

>>_drop_link())

>> and then look at the value of 'deleted', and only continue using the

>>File

>> if 'deleted' is false.

>>

>> Function BlueRocksWritableFile::Close() does not do this.  The routine

>> does not grab a lock; it calls GetPreallocationStatus(), and if

>> last_allocated_block is > 0, it calls fs->truncate() (which does get a

>> lock) which calls _truncate().  Routine _truncate() lowers the file

>>size,

>> and calls op_file_update(), which posts an update transaction entry to

>>the

>> log for a file that is already unlinked.   This is how we end up with an

>> inode in the transaction log that has disk extents but its link count is

>> zero.

>Yes. i also found the same code path, which truncates the file. Trying

>to write some

>unit test cases to run all these ops in a concurrent way to find the

>problem.

>> Adding a check of the 'deleted' variable after the lock is taken in

>> _truncate() would work, but it bothers me a bit to see

>> BlueRocksWritableFile::Close() using the File struct as if it is still

>> valid.  Folks with more BlueFS experience can decide if that is a

>>problem

>> or not.   They might also be able to describe situations where a file

>> could be unlinked while it is still open, that might lead to better test

>> cases which fire the assert more quickly.

>Added some asserts to find the issue using a test with multiple threads.

>Still

>trying reproduce the issue. I believe we should have a reference to the

>file. Will verify that

>anyways.

>>

>> I'm also a bit concerned about the integrity of the BlueFS filesystem,

>> should the log have to be replayed.  This problem was only seen because

>> the inode number wasn't reused in the log after it was released.  What

>>if

>> a subsequent log transaction reallocates the inode?  The "file with link

>> count 0" assert would not catch it.  I guess that subsequent

>>reallocation

>> should overlay the bad FileRef->fnode with a new fnode, so maybe it's

>> harmless?

>Hopefully. Will try to add some debug around reference count of the file

>and

>verify the theory.

>

>> It would be nice to have my assert() trigger to provide the "smoking

>>gun".

>>  However, if I can't duplicate the problem in the next day or two, I

>>will

>> probably just patch _truncate() to return immediately if 'deleted' is

>> true, and move on.

>>

>> What do you think, should I post this history to bug 15724, create a

>> separate bug, anything else?

>Please post your findings to 15724, if we find a different problem we

>can file a

>different ticket.

>

>Varada

>> Thanks, Kevan

>>

>> On 7/16/16, 9:35 AM, "ceph-devel-owner@vger.kernel.org on behalf of

>>Kevan

>> Rehm" <ceph-devel-owner@vger.kernel.org on behalf of krehm@cray.com>

>>wrote:

>>

>>>

>>> On 7/16/16, 7:16 AM, "Varada Kari" <Varada.Kari@sandisk.com> wrote:

>>>

>>>> On Saturday 16 July 2016 04:53 PM, Kevan Rehm wrote:

>>>>> Varada, see below.

>>>>>

>>>>> On 7/16/16, 2:23 AM, "Varada Kari" <Varada.Kari@sandisk.com> wrote:

>>>>>

>>>>>> On Saturday 16 July 2016 01:52 AM, Kevan Rehm wrote:

>>>>>>> Greetings,

>>>>>>>

>>>>>>> I have made a bit of progress.  Still looking for ideas/advise

>>>>>>>while

>>>>>>> I

>>>>>>> keep digging.

>>>>>>>

>>>>>>> First, the ceph version I forgot to mention last time.

>>>>>>>

>>>>>>> # ceph -v

>>>>>>> ceph version 11.0.0-196-g85bb43e

>>>>>>> (85bb43e111692989d2296a389ce45377d2297d6f)

>>>>>>>

>>>>>>>

>>>>>>>

>>>>>>> As a refresher, here is the beginning of the problem transaction

>>>>>>> which

>>>>>>> will result in an inode with 0 links.   There are actually three

>>>>>>> inodes

>>>>>>> in

>>>>>>> this transaction which will all end up having 0 links, inodes 109,

>>>>>>> 102,

>>>>>>> and 113.

>>>>>>>

>>>>>>> 2016-07-13 16:03:03.929679 7fb6617c1800 10 bluefs _replay 0x5be000:

>>>>>>> txn(seq 77752 len 35269 crc 1796157826)

>>>>>>> 2016-07-13 16:03:03.929681 7fb6617c1800 20 bluefs _replay 0x5be000:

>>>>>>> op_dir_unlink  db/000154.log

>>>>>>> 2016-07-13 16:03:03.929683 7fb6617c1800 20 bluefs _replay 0x5be000:

>>>>>>> op_file_remove 109

>>>>>>> 2016-07-13 16:03:03.929684 7fb6617c1800 20 bluefs _replay 0x5be000:

>>>>>>> op_dir_unlink  db/000153.log

>>>>>>> 2016-07-13 16:03:03.929685 7fb6617c1800 20 bluefs _replay 0x5be000:

>>>>>>> op_file_remove 102

>>>>>>> 2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay 0x5be000:

>>>>>>> op_dir_unlink  db/000151.log

>>>>>>> 2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay 0x5be000:

>>>>>>> op_file_remove 113

>>>>>>> 2016-07-13 16:03:03.929687 7fb6617c1800 20 bluefs _replay 0x5be000:

>>>>>>> op_dir_unlink  db/000150.log

>>>>>>> 2016-07-13 16:03:03.929688 7fb6617c1800 20 bluefs _replay 0x5be000:

>>>>>>> op_file_remove 111

>>>>>>> 2016-07-13 16:03:03.929689 7fb6617c1800 20 bluefs _replay 0x5be000:

>>>>>>> op_file_update  file(ino 113 size 71354938 mtime 2016-07-12

>>>>>>> 17:05:53.738683 bdev 1 extents

>>>>>>>

>>>>>>>

>>>>>>> 

>>>>>>>[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:93113

>>>>>>>54

>>>>>>> 8

>>>>>>> 8+

>>>>>>> 20

>>>>>>>

>>>>>>>

>>>>>>> 

>>>>>>>97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:

>>>>>>>96

>>>>>>> 0

>>>>>>> 49

>>>>>>> 56

>>>>>>> 16+58720256])

>>>>>>> 2016-07-13 16:03:03.929693 7fb6617c1800 30 bluefs _get_file ino

>>>>>>>113 =

>>>>>>> 0x7fb66aafc880 (new)

>>>>>> When _get_file() is called with a inode number, if the file is not

>>>>>> found

>>>>>> in file_map

>>>>>> a new file is created and returned, hence that (new) in the log

>>>>>>file.

>>>>>> These set of events

>>>>>> can happen in case of rename. You can see the same BlueFS::rename().

>>>>>> In

>>>>>> that case, we use the

>>>>>> same inode number.

>>>>> I'm not sure I understand, I do not believe this is a rename case.

>>>>>For

>>>>> this to be a rename, there would have to be at least one

>>>>>op_dir_link()

>>>>> call for each op_dir_unlink() call in the sequence above.   The only

>>>>> code

>>>>> sequence I can find where an op_dir_unlink() and an op_file_remove()

>>>>> appear together without a corresponding op_dir_link() is when a file

>>>>>is

>>>>> being deleted.

>>>>>

>>>>> The (new) above makes my point exactly.  The op_file_remove() call

>>>>> above

>>>>> freed the block allocation for the inode and released the inode from

>>>>> the

>>>>> filemap, showing that the file->refs had to be zero at that time

>>>>> (routine

>>>>> _drop_link).   Yet the following op_file_update() call is performing

>>>>>a

>>>>> truncate on that same inode, which is not in the filemap.   So the

>>>>> caller

>>>>> has to be holding a file structure whose refs field is greater than

>>>>> zero.

>>>>> Which implies that the caller's file structure has to be stale.  The

>>>>> fact

>>>>> that the op_file_update file contents for inode 113 is exactly the

>>>>>same

>>>>> as

>>>>> the previous op_file_update contents prior to the op_file_remove(),

>>>>> except

>>>>> for a reduction in file size, proves I believe that the file struct

>>>>> used

>>>>> by the op_file_update() caller is stale.

>>>>>

>>>>>

>>>>>>> 2016-07-13 16:03:03.929695 7fb6617c1800 20 bluefs _replay 0x5be000:

>>>>>>> op_file_update  file(ino 102 size 68557043 mtime 2016-07-12

>>>>>>> 17:05:54.615190 bdev 1 extents

>>>>>>>

>>>>>>>

>>>>>>> 

>>>>>>>[1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:202375

>>>>>>>16

>>>>>>> 8

>>>>>>> +2

>>>>>>> 09

>>>>>>>

>>>>>>>

>>>>>>> 

>>>>>>>7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1:2

>>>>>>>27

>>>>>>> 5

>>>>>>> 40

>>>>>>> 99

>>>>>>>

>>>>>>>

>>>>>>> 

>>>>>>>2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+209715

>>>>>>>2,

>>>>>>> 1

>>>>>>> :2

>>>>>>> 51

>>>>>>>

>>>>>>>

>>>>>>> 

>>>>>>>658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032+2

>>>>>>>09

>>>>>>> 7

>>>>>>> 15

>>>>>>> 2,

>>>>>>>

>>>>>>>

>>>>>>> 

>>>>>>>1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:294649

>>>>>>>85

>>>>>>> 6

>>>>>>> +1

>>>>>>> 04

>>>>>>>

>>>>>>>

>>>>>>> 

>>>>>>>8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1:3

>>>>>>>18

>>>>>>> 7

>>>>>>> 67

>>>>>>> 10

>>>>>>>

>>>>>>>

>>>>>>> 

>>>>>>>4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+209715

>>>>>>>2,

>>>>>>> 1

>>>>>>> :3

>>>>>>> 42

>>>>>>>

>>>>>>>

>>>>>>> 

>>>>>>>884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144+2

>>>>>>>09

>>>>>>> 7

>>>>>>> 15

>>>>>>> 2,

>>>>>>>

>>>>>>>

>>>>>>> 

>>>>>>>1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:383778

>>>>>>>81

>>>>>>> 6

>>>>>>> +1

>>>>>>> 04

>>>>>>>

>>>>>>>

>>>>>>> 

>>>>>>>8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1:4

>>>>>>>05

>>>>>>> 7

>>>>>>> 98

>>>>>>> 91

>>>>>>> 2+3145728,1:507510784+5242880])

>>>>>>> 2016-07-13 16:03:03.929702 7fb6617c1800 30 bluefs _get_file ino

>>>>>>>102 =

>>>>>>> 0x7fb66aafbb90 (new)

>>>>>>> 2016-07-13 16:03:03.929703 7fb6617c1800 20 bluefs _replay 0x5be000:

>>>>>>> op_file_update  file(ino 109 size 64421913 mtime 2016-07-12

>>>>>>> 22:41:33.028030 bdev 1 extents

>>>>>>> [1:512753664+31457280,1:545259520+4194304,1:550502400+36700160])

>>>>>>> 2016-07-13 16:03:03.929705 7fb6617c1800 30 bluefs _get_file ino

>>>>>>>109 =

>>>>>>> 0x7fb66aafbe60 (new)

>>>>>>> ...

>>>>>>>

>>>>>>>

>>>>>>> The prior transaction log entry for each of the three inodes was

>>>>>>> another

>>>>>>> op_file_update() call.  The only difference between the two updates

>>>>>>> for

>>>>>>> ino 133 was file size went from 71359362 -> 71354938, 4424 less

>>>>>>> bytes,

>>>>>>> all

>>>>>>> other fields are identical.

>>>>>> Are the extents same for both the events?

>>>>> Yes, please compare this line with the one higher above in this

>>>>>email,

>>>>> you

>>>>> can see that only the file size is different.   That is true for all

>>>>> three

>>>>> of these inodes.

>>>>>

>>>>>>> 2016-07-13 16:03:03.925186 7fb6617c1800 20 bluefs _replay 0x5b4000:

>>>>>>> op_file_update  file(ino 113 size 71359362 mtime 2016-07-12

>>>>>>> 17:05:53.738683 bdev 1 extents

>>>>>>>

>>>>>>>

>>>>>>> 

>>>>>>>[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:93113

>>>>>>>54

>>>>>>> 8

>>>>>>> 8+

>>>>>>> 20

>>>>>>>

>>>>>>>

>>>>>>> 

>>>>>>>97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:

>>>>>>>96

>>>>>>> 0

>>>>>>> 49

>>>>>>> 56

>>>>>>> 16+58720256])

>>>>>>> 2016-07-13 16:03:03.925197 7fb6617c1800 30 bluefs _get_file ino

>>>>>>>113 =

>>>>>>> 0x7fb66aafbb90

>>>>>>>

>>>>>>>

>>>>>>> Same story for inode 109, the file size changed from 71354939 ->

>>>>>>> 64421913,

>>>>>>> 6933026 less bytes, all other fields are identical.

>>>>>>>

>>>>>>> 2016-07-13 16:03:03.926101 7fb6617c1800 20 bluefs _replay 0x5ba000:

>>>>>>> op_file_update  file(ino 109 size 71354939 mtime 2016-07-12

>>>>>>> 22:41:33.028030 bdev 1 extents

>>>>>>> [1:512753664+31457280,1:545259520+4194304,1:550502400+36700160])

>>>>>>> 2016-07-13 16:03:03.926103 7fb6617c1800 30 bluefs _get_file ino

>>>>>>>109 =

>>>>>>> 0x7fb66aafc910

>>>>>>>

>>>>>>>

>>>>>>>

>>>>>>>

>>>>>>>

>>>>>>> Same story for inode 102, the file size changed from 71359405 ->

>>>>>>> 68557043,

>>>>>>> 2802362 less bytes, all other fields are identical.

>>>>>>>

>>>>>>> 2016-07-13 16:03:03.926040 7fb6617c1800 20 bluefs _replay 0x5b9000:

>>>>>>> op_file_update  file(ino 102 size 71359405 mtime 2016-07-12

>>>>>>> 17:05:54.615190 bdev 1 extents

>>>>>>>

>>>>>>>

>>>>>>> 

>>>>>>>[1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:202375

>>>>>>>16

>>>>>>> 8

>>>>>>> +2

>>>>>>> 09

>>>>>>>

>>>>>>>

>>>>>>> 

>>>>>>>7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1:2

>>>>>>>27

>>>>>>> 5

>>>>>>> 40

>>>>>>> 99

>>>>>>>

>>>>>>>

>>>>>>> 

>>>>>>>2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+209715

>>>>>>>2,

>>>>>>> 1

>>>>>>> :2

>>>>>>> 51

>>>>>>>

>>>>>>>

>>>>>>> 

>>>>>>>658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032+2

>>>>>>>09

>>>>>>> 7

>>>>>>> 15

>>>>>>> 2,

>>>>>>>

>>>>>>>

>>>>>>> 

>>>>>>>1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:294649

>>>>>>>85

>>>>>>> 6

>>>>>>> +1

>>>>>>> 04

>>>>>>>

>>>>>>>

>>>>>>> 

>>>>>>>8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1:3

>>>>>>>18

>>>>>>> 7

>>>>>>> 67

>>>>>>> 10

>>>>>>>

>>>>>>>

>>>>>>> 

>>>>>>>4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+209715

>>>>>>>2,

>>>>>>> 1

>>>>>>> :3

>>>>>>> 42

>>>>>>>

>>>>>>>

>>>>>>> 

>>>>>>>884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144+2

>>>>>>>09

>>>>>>> 7

>>>>>>> 15

>>>>>>> 2,

>>>>>>>

>>>>>>>

>>>>>>> 

>>>>>>>1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:383778

>>>>>>>81

>>>>>>> 6

>>>>>>> +1

>>>>>>> 04

>>>>>>>

>>>>>>>

>>>>>>> 

>>>>>>>8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1:4

>>>>>>>05

>>>>>>> 7

>>>>>>> 98

>>>>>>> 91

>>>>>>> 2+3145728,1:507510784+5242880])

>>>>>>> 2016-07-13 16:03:03.926048 7fb6617c1800 30 bluefs _get_file ino

>>>>>>>102 =

>>>>>>> 0x7fb66aafbe60

>>>>>>>

>>>>>>>

>>>>>>>

>>>>>>>

>>>>>>> There is only one caller of op_file_update() in which the size of

>>>>>>>the

>>>>>>> file

>>>>>>> is reduced but all other fields stay the same, that is

>>>>>>> BlueFS::_truncate(), which in turn is only called by

>>>>>>> BlueFS::truncate(),

>>>>>>> which is only called by BlueRocksWritableFile::Close().

>>>>>>>

>>>>>>> The op_dir_unlink() and op_file_remove() calls for each of the

>>>>>>>three

>>>>>>> inodes seem to be coming from BlueRocksEnv::DeleteFile(), but I'm

>>>>>>> having

>>>>>>> difficulty finding where the call is coming from.

>>>>>> Any syscall making dir rename or file remove or directory remove

>>>>>>will

>>>>>> result in BlueFS::rename or unlink

>>>>>> or corresponding calls and results in op_dir_* or op_file_*

>>>>>> operations.

>>>>>> rocksdb wal_manager or compaction

>>>>>> can trigger these calls during that process. Will take a closer look

>>>>>> at

>>>>>> the logs, if i can get something from them.

>>>>>>> So the problem appears to be that the file is first deleted, then

>>>>>>>it

>>>>>>> is

>>>>>>> closed, and a byproduct of the close operation  on a stale fnode is

>>>>>>> that

>>>>>>> another op_file_update() entry gets logged, recreating the deleted

>>>>>>> inode

>>>>>>> with a stale fnode, resulting in the 0 links failure.

>>>>>> Once the file is deleted, all the underlying extents will be

>>>>>>cleared.

>>>>>> Until and unless there is race in handling on

>>>>>> the files from application level, Bluefs doesn't have any control

>>>>>>over

>>>>>> them. If the a file(handle) is been reused after

>>>>>> deletion, that seems to be a bug. Will take a look at the logs to

>>>>>> confirm your theory.

>>>>>> But here, this seems to be more of rename use case. one more thing

>>>>>>to

>>>>>> verify is the extents owned by the inodes mentioned.

>>>>> No, I don't believe this is a rename, see above.

>>>>>

>>>>> I'll note that every occurrence of this problem has been preceded by

>>>>>an

>>>>> OSD segfault.  And this problem is racy, if I increase debug logging

>>>>> for

>>>>> bluefs or bluestore, then the problem goes away.  I have 50+

>>>>>bluestore

>>>>> segfaults and asserts in my cluster logs; I suspect many could be due

>>>>> to

>>>>> this problem, because the same range of blocks is being freed more

>>>>>than

>>>>> once.   If some other file should perform an allocation between the

>>>>> block-freeing in op_file_remove() and the second freeing in

>>>>> op_file_update(), then we could easily have multiple files with the

>>>>> same

>>>>> block allocation at the same time.

>>>>>

>>>>> I think this problem can be caught fairly easily, an assert could be

>>>>> added

>>>>> to _truncate().   The _truncate() routine is only called for files

>>>>> which

>>>>> are open, which means that the inode has to be in the filemap.   Code

>>>>> could be added at the front of _truncate() to look up the inode in

>>>>>the

>>>>> filemap, and if it is not there, then assert.   That would provide a

>>>>> stack

>>>>> trace showing the caller using the file struct with the stale

>>>>>reference

>>>>> count.   (It may be a few days before I have a chance to try this

>>>>> myself.)

>>>> Sure. Could you please let me know the steps to hit issue?

>>>> i am doing only few min runs to check the performance and some unit

>>>>test

>>>> cases.

>>>> I will add the necessary asserts if i can reproduce the issue and fix

>>>>it.

>>> Thanks for the help!   I will have to get back to you on Monday, as

>>>there

>>> is another person on our team who runs the benchmarks that trigger this

>>> problem, I don't know how to run his benchmarks myself.

>>>

>>> Kevan

>>>

>>>> Varada

>>>>> Regards, Kevan

>>>>>

>>>>>

>>>>>

>>>> PLEASE NOTE: The information contained in this electronic mail

>>>>message is

>>>> intended only for the use of the designated recipient(s) named above.

>>>>If

>>>> the reader of this message is not the intended recipient, you are

>>>>hereby

>>>> notified that you have received this message in error and that any

>>>> review, dissemination, distribution, or copying of this message is

>>>> strictly prohibited. If you have received this communication in error,

>>>> please notify the sender by telephone or e-mail (as shown above)

>>>> immediately and destroy any and all copies of this message in your

>>>> possession (whether hard copies or electronically stored copies).

>>> --

>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"

>>>in

>>> the body of a message to majordomo@vger.kernel.org

>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

>>

>

>PLEASE NOTE: The information contained in this electronic mail message is

>intended only for the use of the designated recipient(s) named above. If

>the reader of this message is not the intended recipient, you are hereby

>notified that you have received this message in error and that any

>review, dissemination, distribution, or copying of this message is

>strictly prohibited. If you have received this communication in error,

>please notify the sender by telephone or e-mail (as shown above)

>immediately and destroy any and all copies of this message in your

>possession (whether hard copies or electronically stored copies).
Varada Kari July 19, 2016, 4:40 p.m. UTC | #3
You can see the asserts in destructor of File(~File). if we have any
valid refcount, i think boost
asserts to delete the object. These are intrusive pointers.

Will update if i have a successful recreate of the same scenario.

Varada

On Tuesday 19 July 2016 08:38 PM, Kevan Rehm wrote:
> Varada,

>

>> If we are using FileWriter, we will(should) have a reference on the

>> fnode.

>> I don't think that will be deleted from the file_map. Will wait for your

>> run.

>>

>

>> open_for_read and open_for_write will take a reference on the file and

>> we will increment

>> num_readers and num_writers on the file reference. we have asserts to

>> catch if we have any

>> readers/writers in case file deletion.

>

> I had hoped that, but can't find them in the code, unless they were added

> recently.  See routine _drop_link(), when the link count hits zero, it

> proceeds immediately with the inode/extent deletion.  It does have an

> assert there checking file->num_reading, but (in my copy) no checks for

> num_readers and num_writers.  There are also no checks of the 'nref' in

> RefCountedObject from which File inherits.    If you do find any of these

> elsewhere that provide protection, please let me know, I'd like to

> understand this code better.

>

> Thanks for your help digging into this, and trying to reproduce it.  I'll

> update the bug shortly.

>

> Kevan

>

>

> On 7/19/16, 9:50 AM, "Varada Kari" <Varada.Kari@sandisk.com> wrote:

>

>> On Tuesday 19 July 2016 07:40 PM, Kevan Rehm wrote:

>>> Varada,

>>>

>>> I added the following assert in _truncate() but haven't yet had it fire

>>> in

>>> our test cluster.   If you or others also want to give it a try, that

>>> would be great.  It seems like a useful assert.

>>>

>>> diff --git a/src/os/bluestore/BlueFS.cc b/src/os/bluestore/BlueFS.cc

>>> index dbfd59b..759d1af 100644

>>> --- a/src/os/bluestore/BlueFS.cc

>>> +++ b/src/os/bluestore/BlueFS.cc

>>> @@ -1299,6 +1299,13 @@ int BlueFS::_truncate(FileWriter *h, uint64_t

>>> offset)

>>>    dout(10) << __func__ << " 0x" << std::hex << offset << std::dec

>>>             << " file " << h->file->fnode << dendl;

>>>

>>> +  // If ino is not in filemap, we are truncating a file that is not

>>> allocated.

>>> +

>>> +  auto p = file_map.find(h->file->fnode.ino);

>>> +  if (p == file_map.end()) {

>>> +    assert(0 == "_truncate(): truncating a file that is not

>>> allocated");

>>> +  }

>>> +

>>>    // truncate off unflushed data?

>>>    if (h->pos < offset &&

>>>        h->pos + h->buffer.length() > offset) {

>> If we are using FileWriter, we will(should) have a reference on the fnode.

>> I don't think that will be deleted from the file_map. Will wait for your

>> run.

>>> The test case which seemed to make the problem occur more frequently for

>>> us consists of mounting a cephfs filesystem, then starting up 32

>>> instances

>>> of ior on each of 4 client machines.  The ior command looks like this:

>>>

>>> ior -a POSIX -o  /mnt/cephfs/test -b 8m -i 10

>>>

>>>

>>> In the meantime I have been looking further at the code, and I think I

>>> see

>>> the problem-causing sequence of events.

>>>

>>> BlueFS implements unlink() differently than the POSIX model.  In POSIX,

>>> the inode and disk extents for a file are not released until both the

>>> link

>>> count reaches zero and the inode reference count (file open count)

>>> reaches

>>> zero.  BlueFS however releases the inode and disk extents when the link

>>> count (file->refs) reaches zero even if there are threads that still

>>> have

>>> the file open.

>> open_for_read and open_for_write will take a reference on the file and

>> we will increment

>> num_readers and num_writers on the file reference. we have asserts to

>> catch if we have any

>> readers/writers in case file deletion.

>>> Routine _drop_link() does the inode and disk extent releases, and it

>>> sets

>>> a File variable called 'deleted' to true, but note that it does not

>>> otherwise clear the released inode number or disk extent fields in the

>>> File.  It is therefore the responsibility of every thread with a FileRef

>>> for that file to first obtain a lock (to prevent races with

>>> _drop_link())

>>> and then look at the value of 'deleted', and only continue using the

>>> File

>>> if 'deleted' is false.

>>>

>>> Function BlueRocksWritableFile::Close() does not do this.  The routine

>>> does not grab a lock; it calls GetPreallocationStatus(), and if

>>> last_allocated_block is > 0, it calls fs->truncate() (which does get a

>>> lock) which calls _truncate().  Routine _truncate() lowers the file

>>> size,

>>> and calls op_file_update(), which posts an update transaction entry to

>>> the

>>> log for a file that is already unlinked.   This is how we end up with an

>>> inode in the transaction log that has disk extents but its link count is

>>> zero.

>> Yes. i also found the same code path, which truncates the file. Trying

>> to write some

>> unit test cases to run all these ops in a concurrent way to find the

>> problem.

>>> Adding a check of the 'deleted' variable after the lock is taken in

>>> _truncate() would work, but it bothers me a bit to see

>>> BlueRocksWritableFile::Close() using the File struct as if it is still

>>> valid.  Folks with more BlueFS experience can decide if that is a

>>> problem

>>> or not.   They might also be able to describe situations where a file

>>> could be unlinked while it is still open, that might lead to better test

>>> cases which fire the assert more quickly.

>> Added some asserts to find the issue using a test with multiple threads.

>> Still

>> trying reproduce the issue. I believe we should have a reference to the

>> file. Will verify that

>> anyways.

>>> I'm also a bit concerned about the integrity of the BlueFS filesystem,

>>> should the log have to be replayed.  This problem was only seen because

>>> the inode number wasn't reused in the log after it was released.  What

>>> if

>>> a subsequent log transaction reallocates the inode?  The "file with link

>>> count 0" assert would not catch it.  I guess that subsequent

>>> reallocation

>>> should overlay the bad FileRef->fnode with a new fnode, so maybe it's

>>> harmless?

>> Hopefully. Will try to add some debug around reference count of the file

>> and

>> verify the theory.

>>

>>> It would be nice to have my assert() trigger to provide the "smoking

>>> gun".

>>>  However, if I can't duplicate the problem in the next day or two, I

>>> will

>>> probably just patch _truncate() to return immediately if 'deleted' is

>>> true, and move on.

>>>

>>> What do you think, should I post this history to bug 15724, create a

>>> separate bug, anything else?

>> Please post your findings to 15724, if we find a different problem we

>> can file a

>> different ticket.

>>

>> Varada

>>> Thanks, Kevan

>>>

>>> On 7/16/16, 9:35 AM, "ceph-devel-owner@vger.kernel.org on behalf of

>>> Kevan

>>> Rehm" <ceph-devel-owner@vger.kernel.org on behalf of krehm@cray.com>

>>> wrote:

>>>

>>>> On 7/16/16, 7:16 AM, "Varada Kari" <Varada.Kari@sandisk.com> wrote:

>>>>

>>>>> On Saturday 16 July 2016 04:53 PM, Kevan Rehm wrote:

>>>>>> Varada, see below.

>>>>>>

>>>>>> On 7/16/16, 2:23 AM, "Varada Kari" <Varada.Kari@sandisk.com> wrote:

>>>>>>

>>>>>>> On Saturday 16 July 2016 01:52 AM, Kevan Rehm wrote:

>>>>>>>> Greetings,

>>>>>>>>

>>>>>>>> I have made a bit of progress.  Still looking for ideas/advise

>>>>>>>> while

>>>>>>>> I

>>>>>>>> keep digging.

>>>>>>>>

>>>>>>>> First, the ceph version I forgot to mention last time.

>>>>>>>>

>>>>>>>> # ceph -v

>>>>>>>> ceph version 11.0.0-196-g85bb43e

>>>>>>>> (85bb43e111692989d2296a389ce45377d2297d6f)

>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>> As a refresher, here is the beginning of the problem transaction

>>>>>>>> which

>>>>>>>> will result in an inode with 0 links.   There are actually three

>>>>>>>> inodes

>>>>>>>> in

>>>>>>>> this transaction which will all end up having 0 links, inodes 109,

>>>>>>>> 102,

>>>>>>>> and 113.

>>>>>>>>

>>>>>>>> 2016-07-13 16:03:03.929679 7fb6617c1800 10 bluefs _replay 0x5be000:

>>>>>>>> txn(seq 77752 len 35269 crc 1796157826)

>>>>>>>> 2016-07-13 16:03:03.929681 7fb6617c1800 20 bluefs _replay 0x5be000:

>>>>>>>> op_dir_unlink  db/000154.log

>>>>>>>> 2016-07-13 16:03:03.929683 7fb6617c1800 20 bluefs _replay 0x5be000:

>>>>>>>> op_file_remove 109

>>>>>>>> 2016-07-13 16:03:03.929684 7fb6617c1800 20 bluefs _replay 0x5be000:

>>>>>>>> op_dir_unlink  db/000153.log

>>>>>>>> 2016-07-13 16:03:03.929685 7fb6617c1800 20 bluefs _replay 0x5be000:

>>>>>>>> op_file_remove 102

>>>>>>>> 2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay 0x5be000:

>>>>>>>> op_dir_unlink  db/000151.log

>>>>>>>> 2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay 0x5be000:

>>>>>>>> op_file_remove 113

>>>>>>>> 2016-07-13 16:03:03.929687 7fb6617c1800 20 bluefs _replay 0x5be000:

>>>>>>>> op_dir_unlink  db/000150.log

>>>>>>>> 2016-07-13 16:03:03.929688 7fb6617c1800 20 bluefs _replay 0x5be000:

>>>>>>>> op_file_remove 111

>>>>>>>> 2016-07-13 16:03:03.929689 7fb6617c1800 20 bluefs _replay 0x5be000:

>>>>>>>> op_file_update  file(ino 113 size 71354938 mtime 2016-07-12

>>>>>>>> 17:05:53.738683 bdev 1 extents

>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>> [1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:93113

>>>>>>>> 54

>>>>>>>> 8

>>>>>>>> 8+

>>>>>>>> 20

>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>> 97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:

>>>>>>>> 96

>>>>>>>> 0

>>>>>>>> 49

>>>>>>>> 56

>>>>>>>> 16+58720256])

>>>>>>>> 2016-07-13 16:03:03.929693 7fb6617c1800 30 bluefs _get_file ino

>>>>>>>> 113 =

>>>>>>>> 0x7fb66aafc880 (new)

>>>>>>> When _get_file() is called with a inode number, if the file is not

>>>>>>> found

>>>>>>> in file_map

>>>>>>> a new file is created and returned, hence that (new) in the log

>>>>>>> file.

>>>>>>> These set of events

>>>>>>> can happen in case of rename. You can see the same BlueFS::rename().

>>>>>>> In

>>>>>>> that case, we use the

>>>>>>> same inode number.

>>>>>> I'm not sure I understand, I do not believe this is a rename case.

>>>>>> For

>>>>>> this to be a rename, there would have to be at least one

>>>>>> op_dir_link()

>>>>>> call for each op_dir_unlink() call in the sequence above.   The only

>>>>>> code

>>>>>> sequence I can find where an op_dir_unlink() and an op_file_remove()

>>>>>> appear together without a corresponding op_dir_link() is when a file

>>>>>> is

>>>>>> being deleted.

>>>>>>

>>>>>> The (new) above makes my point exactly.  The op_file_remove() call

>>>>>> above

>>>>>> freed the block allocation for the inode and released the inode from

>>>>>> the

>>>>>> filemap, showing that the file->refs had to be zero at that time

>>>>>> (routine

>>>>>> _drop_link).   Yet the following op_file_update() call is performing

>>>>>> a

>>>>>> truncate on that same inode, which is not in the filemap.   So the

>>>>>> caller

>>>>>> has to be holding a file structure whose refs field is greater than

>>>>>> zero.

>>>>>> Which implies that the caller's file structure has to be stale.  The

>>>>>> fact

>>>>>> that the op_file_update file contents for inode 113 is exactly the

>>>>>> same

>>>>>> as

>>>>>> the previous op_file_update contents prior to the op_file_remove(),

>>>>>> except

>>>>>> for a reduction in file size, proves I believe that the file struct

>>>>>> used

>>>>>> by the op_file_update() caller is stale.

>>>>>>

>>>>>>

>>>>>>>> 2016-07-13 16:03:03.929695 7fb6617c1800 20 bluefs _replay 0x5be000:

>>>>>>>> op_file_update  file(ino 102 size 68557043 mtime 2016-07-12

>>>>>>>> 17:05:54.615190 bdev 1 extents

>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>> [1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:202375

>>>>>>>> 16

>>>>>>>> 8

>>>>>>>> +2

>>>>>>>> 09

>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>> 7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1:2

>>>>>>>> 27

>>>>>>>> 5

>>>>>>>> 40

>>>>>>>> 99

>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>> 2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+209715

>>>>>>>> 2,

>>>>>>>> 1

>>>>>>>> :2

>>>>>>>> 51

>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>> 658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032+2

>>>>>>>> 09

>>>>>>>> 7

>>>>>>>> 15

>>>>>>>> 2,

>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>> 1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:294649

>>>>>>>> 85

>>>>>>>> 6

>>>>>>>> +1

>>>>>>>> 04

>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>> 8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1:3

>>>>>>>> 18

>>>>>>>> 7

>>>>>>>> 67

>>>>>>>> 10

>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>> 4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+209715

>>>>>>>> 2,

>>>>>>>> 1

>>>>>>>> :3

>>>>>>>> 42

>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>> 884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144+2

>>>>>>>> 09

>>>>>>>> 7

>>>>>>>> 15

>>>>>>>> 2,

>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>> 1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:383778

>>>>>>>> 81

>>>>>>>> 6

>>>>>>>> +1

>>>>>>>> 04

>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>> 8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1:4

>>>>>>>> 05

>>>>>>>> 7

>>>>>>>> 98

>>>>>>>> 91

>>>>>>>> 2+3145728,1:507510784+5242880])

>>>>>>>> 2016-07-13 16:03:03.929702 7fb6617c1800 30 bluefs _get_file ino

>>>>>>>> 102 =

>>>>>>>> 0x7fb66aafbb90 (new)

>>>>>>>> 2016-07-13 16:03:03.929703 7fb6617c1800 20 bluefs _replay 0x5be000:

>>>>>>>> op_file_update  file(ino 109 size 64421913 mtime 2016-07-12

>>>>>>>> 22:41:33.028030 bdev 1 extents

>>>>>>>> [1:512753664+31457280,1:545259520+4194304,1:550502400+36700160])

>>>>>>>> 2016-07-13 16:03:03.929705 7fb6617c1800 30 bluefs _get_file ino

>>>>>>>> 109 =

>>>>>>>> 0x7fb66aafbe60 (new)

>>>>>>>> ...

>>>>>>>>

>>>>>>>>

>>>>>>>> The prior transaction log entry for each of the three inodes was

>>>>>>>> another

>>>>>>>> op_file_update() call.  The only difference between the two updates

>>>>>>>> for

>>>>>>>> ino 133 was file size went from 71359362 -> 71354938, 4424 less

>>>>>>>> bytes,

>>>>>>>> all

>>>>>>>> other fields are identical.

>>>>>>> Are the extents same for both the events?

>>>>>> Yes, please compare this line with the one higher above in this

>>>>>> email,

>>>>>> you

>>>>>> can see that only the file size is different.   That is true for all

>>>>>> three

>>>>>> of these inodes.

>>>>>>

>>>>>>>> 2016-07-13 16:03:03.925186 7fb6617c1800 20 bluefs _replay 0x5b4000:

>>>>>>>> op_file_update  file(ino 113 size 71359362 mtime 2016-07-12

>>>>>>>> 17:05:53.738683 bdev 1 extents

>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>> [1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:93113

>>>>>>>> 54

>>>>>>>> 8

>>>>>>>> 8+

>>>>>>>> 20

>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>> 97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,1:

>>>>>>>> 96

>>>>>>>> 0

>>>>>>>> 49

>>>>>>>> 56

>>>>>>>> 16+58720256])

>>>>>>>> 2016-07-13 16:03:03.925197 7fb6617c1800 30 bluefs _get_file ino

>>>>>>>> 113 =

>>>>>>>> 0x7fb66aafbb90

>>>>>>>>

>>>>>>>>

>>>>>>>> Same story for inode 109, the file size changed from 71354939 ->

>>>>>>>> 64421913,

>>>>>>>> 6933026 less bytes, all other fields are identical.

>>>>>>>>

>>>>>>>> 2016-07-13 16:03:03.926101 7fb6617c1800 20 bluefs _replay 0x5ba000:

>>>>>>>> op_file_update  file(ino 109 size 71354939 mtime 2016-07-12

>>>>>>>> 22:41:33.028030 bdev 1 extents

>>>>>>>> [1:512753664+31457280,1:545259520+4194304,1:550502400+36700160])

>>>>>>>> 2016-07-13 16:03:03.926103 7fb6617c1800 30 bluefs _get_file ino

>>>>>>>> 109 =

>>>>>>>> 0x7fb66aafc910

>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>> Same story for inode 102, the file size changed from 71359405 ->

>>>>>>>> 68557043,

>>>>>>>> 2802362 less bytes, all other fields are identical.

>>>>>>>>

>>>>>>>> 2016-07-13 16:03:03.926040 7fb6617c1800 20 bluefs _replay 0x5b9000:

>>>>>>>> op_file_update  file(ino 102 size 71359405 mtime 2016-07-12

>>>>>>>> 17:05:54.615190 bdev 1 extents

>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>> [1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:202375

>>>>>>>> 16

>>>>>>>> 8

>>>>>>>> +2

>>>>>>>> 09

>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>> 7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1:2

>>>>>>>> 27

>>>>>>>> 5

>>>>>>>> 40

>>>>>>>> 99

>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>> 2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+209715

>>>>>>>> 2,

>>>>>>>> 1

>>>>>>>> :2

>>>>>>>> 51

>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>> 658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032+2

>>>>>>>> 09

>>>>>>>> 7

>>>>>>>> 15

>>>>>>>> 2,

>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>> 1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:294649

>>>>>>>> 85

>>>>>>>> 6

>>>>>>>> +1

>>>>>>>> 04

>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>> 8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1:3

>>>>>>>> 18

>>>>>>>> 7

>>>>>>>> 67

>>>>>>>> 10

>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>> 4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+209715

>>>>>>>> 2,

>>>>>>>> 1

>>>>>>>> :3

>>>>>>>> 42

>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>> 884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144+2

>>>>>>>> 09

>>>>>>>> 7

>>>>>>>> 15

>>>>>>>> 2,

>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>> 1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:383778

>>>>>>>> 81

>>>>>>>> 6

>>>>>>>> +1

>>>>>>>> 04

>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>> 8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1:4

>>>>>>>> 05

>>>>>>>> 7

>>>>>>>> 98

>>>>>>>> 91

>>>>>>>> 2+3145728,1:507510784+5242880])

>>>>>>>> 2016-07-13 16:03:03.926048 7fb6617c1800 30 bluefs _get_file ino

>>>>>>>> 102 =

>>>>>>>> 0x7fb66aafbe60

>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>>

>>>>>>>> There is only one caller of op_file_update() in which the size of

>>>>>>>> the

>>>>>>>> file

>>>>>>>> is reduced but all other fields stay the same, that is

>>>>>>>> BlueFS::_truncate(), which in turn is only called by

>>>>>>>> BlueFS::truncate(),

>>>>>>>> which is only called by BlueRocksWritableFile::Close().

>>>>>>>>

>>>>>>>> The op_dir_unlink() and op_file_remove() calls for each of the

>>>>>>>> three

>>>>>>>> inodes seem to be coming from BlueRocksEnv::DeleteFile(), but I'm

>>>>>>>> having

>>>>>>>> difficulty finding where the call is coming from.

>>>>>>> Any syscall making dir rename or file remove or directory remove

>>>>>>> will

>>>>>>> result in BlueFS::rename or unlink

>>>>>>> or corresponding calls and results in op_dir_* or op_file_*

>>>>>>> operations.

>>>>>>> rocksdb wal_manager or compaction

>>>>>>> can trigger these calls during that process. Will take a closer look

>>>>>>> at

>>>>>>> the logs, if i can get something from them.

>>>>>>>> So the problem appears to be that the file is first deleted, then

>>>>>>>> it

>>>>>>>> is

>>>>>>>> closed, and a byproduct of the close operation  on a stale fnode is

>>>>>>>> that

>>>>>>>> another op_file_update() entry gets logged, recreating the deleted

>>>>>>>> inode

>>>>>>>> with a stale fnode, resulting in the 0 links failure.

>>>>>>> Once the file is deleted, all the underlying extents will be

>>>>>>> cleared.

>>>>>>> Until and unless there is race in handling on

>>>>>>> the files from application level, Bluefs doesn't have any control

>>>>>>> over

>>>>>>> them. If the a file(handle) is been reused after

>>>>>>> deletion, that seems to be a bug. Will take a look at the logs to

>>>>>>> confirm your theory.

>>>>>>> But here, this seems to be more of rename use case. one more thing

>>>>>>> to

>>>>>>> verify is the extents owned by the inodes mentioned.

>>>>>> No, I don't believe this is a rename, see above.

>>>>>>

>>>>>> I'll note that every occurrence of this problem has been preceded by

>>>>>> an

>>>>>> OSD segfault.  And this problem is racy, if I increase debug logging

>>>>>> for

>>>>>> bluefs or bluestore, then the problem goes away.  I have 50+

>>>>>> bluestore

>>>>>> segfaults and asserts in my cluster logs; I suspect many could be due

>>>>>> to

>>>>>> this problem, because the same range of blocks is being freed more

>>>>>> than

>>>>>> once.   If some other file should perform an allocation between the

>>>>>> block-freeing in op_file_remove() and the second freeing in

>>>>>> op_file_update(), then we could easily have multiple files with the

>>>>>> same

>>>>>> block allocation at the same time.

>>>>>>

>>>>>> I think this problem can be caught fairly easily, an assert could be

>>>>>> added

>>>>>> to _truncate().   The _truncate() routine is only called for files

>>>>>> which

>>>>>> are open, which means that the inode has to be in the filemap.   Code

>>>>>> could be added at the front of _truncate() to look up the inode in

>>>>>> the

>>>>>> filemap, and if it is not there, then assert.   That would provide a

>>>>>> stack

>>>>>> trace showing the caller using the file struct with the stale

>>>>>> reference

>>>>>> count.   (It may be a few days before I have a chance to try this

>>>>>> myself.)

>>>>> Sure. Could you please let me know the steps to hit issue?

>>>>> i am doing only few min runs to check the performance and some unit

>>>>> test

>>>>> cases.

>>>>> I will add the necessary asserts if i can reproduce the issue and fix

>>>>> it.

>>>> Thanks for the help!   I will have to get back to you on Monday, as

>>>> there

>>>> is another person on our team who runs the benchmarks that trigger this

>>>> problem, I don't know how to run his benchmarks myself.

>>>>

>>>> Kevan

>>>>

>>>>> Varada

>>>>>> Regards, Kevan

>>>>>>

>>>>>>

>>>>>>

>>>>> PLEASE NOTE: The information contained in this electronic mail

>>>>> message is

>>>>> intended only for the use of the designated recipient(s) named above.

>>>>> If

>>>>> the reader of this message is not the intended recipient, you are

>>>>> hereby

>>>>> notified that you have received this message in error and that any

>>>>> review, dissemination, distribution, or copying of this message is

>>>>> strictly prohibited. If you have received this communication in error,

>>>>> please notify the sender by telephone or e-mail (as shown above)

>>>>> immediately and destroy any and all copies of this message in your

>>>>> possession (whether hard copies or electronically stored copies).

>>>> --

>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"

>>>> in

>>>> the body of a message to majordomo@vger.kernel.org

>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

>> PLEASE NOTE: The information contained in this electronic mail message is

>> intended only for the use of the designated recipient(s) named above. If

>> the reader of this message is not the intended recipient, you are hereby

>> notified that you have received this message in error and that any

>> review, dissemination, distribution, or copying of this message is

>> strictly prohibited. If you have received this communication in error,

>> please notify the sender by telephone or e-mail (as shown above)

>> immediately and destroy any and all copies of this message in your

>> possession (whether hard copies or electronically stored copies).

>


PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
Kevan Rehm July 19, 2016, 5:04 p.m. UTC | #4
On 7/19/16, 11:40 AM, "Varada Kari" <Varada.Kari@sandisk.com> wrote:

>You can see the asserts in destructor of File(~File). if we have any

>valid refcount, i think boost

>asserts to delete the object. These are intrusive pointers.

  
Ah, you are correct, I misunderstood your point.  You were saying that
asserts would fire if readers/writers exist at File object destruction
time.   I thought you were saying that asserts would fire if
readers/writers exist when the file's link count goes to zero, those are
the asserts I could not find.
 
>

>Will update if i have a successful recreate of the same scenario.

 
Very good, thanks!

Kevan
 
>

>Varada

>

>On Tuesday 19 July 2016 08:38 PM, Kevan Rehm wrote:

>> Varada,

>>

>>> If we are using FileWriter, we will(should) have a reference on the

>>> fnode.

>>> I don't think that will be deleted from the file_map. Will wait for

>>>your

>>> run.

>>>

>>

>>> open_for_read and open_for_write will take a reference on the file and

>>> we will increment

>>> num_readers and num_writers on the file reference. we have asserts to

>>> catch if we have any

>>> readers/writers in case file deletion.

>>

>> I had hoped that, but can't find them in the code, unless they were

>>added

>> recently.  See routine _drop_link(), when the link count hits zero, it

>> proceeds immediately with the inode/extent deletion.  It does have an

>> assert there checking file->num_reading, but (in my copy) no checks for

>> num_readers and num_writers.  There are also no checks of the 'nref' in

>> RefCountedObject from which File inherits.    If you do find any of

>>these

>> elsewhere that provide protection, please let me know, I'd like to

>> understand this code better.

>>

>> Thanks for your help digging into this, and trying to reproduce it.

>>I'll

>> update the bug shortly.

>>

>> Kevan

>>

>>

>> On 7/19/16, 9:50 AM, "Varada Kari" <Varada.Kari@sandisk.com> wrote:

>>

>>> On Tuesday 19 July 2016 07:40 PM, Kevan Rehm wrote:

>>>> Varada,

>>>>

>>>> I added the following assert in _truncate() but haven't yet had it

>>>>fire

>>>> in

>>>> our test cluster.   If you or others also want to give it a try, that

>>>> would be great.  It seems like a useful assert.

>>>>

>>>> diff --git a/src/os/bluestore/BlueFS.cc b/src/os/bluestore/BlueFS.cc

>>>> index dbfd59b..759d1af 100644

>>>> --- a/src/os/bluestore/BlueFS.cc

>>>> +++ b/src/os/bluestore/BlueFS.cc

>>>> @@ -1299,6 +1299,13 @@ int BlueFS::_truncate(FileWriter *h, uint64_t

>>>> offset)

>>>>    dout(10) << __func__ << " 0x" << std::hex << offset << std::dec

>>>>             << " file " << h->file->fnode << dendl;

>>>>

>>>> +  // If ino is not in filemap, we are truncating a file that is not

>>>> allocated.

>>>> +

>>>> +  auto p = file_map.find(h->file->fnode.ino);

>>>> +  if (p == file_map.end()) {

>>>> +    assert(0 == "_truncate(): truncating a file that is not

>>>> allocated");

>>>> +  }

>>>> +

>>>>    // truncate off unflushed data?

>>>>    if (h->pos < offset &&

>>>>        h->pos + h->buffer.length() > offset) {

>>> If we are using FileWriter, we will(should) have a reference on the

>>>fnode.

>>> I don't think that will be deleted from the file_map. Will wait for

>>>your

>>> run.

>>>> The test case which seemed to make the problem occur more frequently

>>>>for

>>>> us consists of mounting a cephfs filesystem, then starting up 32

>>>> instances

>>>> of ior on each of 4 client machines.  The ior command looks like this:

>>>>

>>>> ior -a POSIX -o  /mnt/cephfs/test -b 8m -i 10

>>>>

>>>>

>>>> In the meantime I have been looking further at the code, and I think I

>>>> see

>>>> the problem-causing sequence of events.

>>>>

>>>> BlueFS implements unlink() differently than the POSIX model.  In

>>>>POSIX,

>>>> the inode and disk extents for a file are not released until both the

>>>> link

>>>> count reaches zero and the inode reference count (file open count)

>>>> reaches

>>>> zero.  BlueFS however releases the inode and disk extents when the

>>>>link

>>>> count (file->refs) reaches zero even if there are threads that still

>>>> have

>>>> the file open.

>>> open_for_read and open_for_write will take a reference on the file and

>>> we will increment

>>> num_readers and num_writers on the file reference. we have asserts to

>>> catch if we have any

>>> readers/writers in case file deletion.

>>>> Routine _drop_link() does the inode and disk extent releases, and it

>>>> sets

>>>> a File variable called 'deleted' to true, but note that it does not

>>>> otherwise clear the released inode number or disk extent fields in the

>>>> File.  It is therefore the responsibility of every thread with a

>>>>FileRef

>>>> for that file to first obtain a lock (to prevent races with

>>>> _drop_link())

>>>> and then look at the value of 'deleted', and only continue using the

>>>> File

>>>> if 'deleted' is false.

>>>>

>>>> Function BlueRocksWritableFile::Close() does not do this.  The routine

>>>> does not grab a lock; it calls GetPreallocationStatus(), and if

>>>> last_allocated_block is > 0, it calls fs->truncate() (which does get a

>>>> lock) which calls _truncate().  Routine _truncate() lowers the file

>>>> size,

>>>> and calls op_file_update(), which posts an update transaction entry to

>>>> the

>>>> log for a file that is already unlinked.   This is how we end up with

>>>>an

>>>> inode in the transaction log that has disk extents but its link count

>>>>is

>>>> zero.

>>> Yes. i also found the same code path, which truncates the file. Trying

>>> to write some

>>> unit test cases to run all these ops in a concurrent way to find the

>>> problem.

>>>> Adding a check of the 'deleted' variable after the lock is taken in

>>>> _truncate() would work, but it bothers me a bit to see

>>>> BlueRocksWritableFile::Close() using the File struct as if it is still

>>>> valid.  Folks with more BlueFS experience can decide if that is a

>>>> problem

>>>> or not.   They might also be able to describe situations where a file

>>>> could be unlinked while it is still open, that might lead to better

>>>>test

>>>> cases which fire the assert more quickly.

>>> Added some asserts to find the issue using a test with multiple

>>>threads.

>>> Still

>>> trying reproduce the issue. I believe we should have a reference to the

>>> file. Will verify that

>>> anyways.

>>>> I'm also a bit concerned about the integrity of the BlueFS filesystem,

>>>> should the log have to be replayed.  This problem was only seen

>>>>because

>>>> the inode number wasn't reused in the log after it was released.  What

>>>> if

>>>> a subsequent log transaction reallocates the inode?  The "file with

>>>>link

>>>> count 0" assert would not catch it.  I guess that subsequent

>>>> reallocation

>>>> should overlay the bad FileRef->fnode with a new fnode, so maybe it's

>>>> harmless?

>>> Hopefully. Will try to add some debug around reference count of the

>>>file

>>> and

>>> verify the theory.

>>>

>>>> It would be nice to have my assert() trigger to provide the "smoking

>>>> gun".

>>>>  However, if I can't duplicate the problem in the next day or two, I

>>>> will

>>>> probably just patch _truncate() to return immediately if 'deleted' is

>>>> true, and move on.

>>>>

>>>> What do you think, should I post this history to bug 15724, create a

>>>> separate bug, anything else?

>>> Please post your findings to 15724, if we find a different problem we

>>> can file a

>>> different ticket.

>>>

>>> Varada

>>>> Thanks, Kevan

>>>>

>>>> On 7/16/16, 9:35 AM, "ceph-devel-owner@vger.kernel.org on behalf of

>>>> Kevan

>>>> Rehm" <ceph-devel-owner@vger.kernel.org on behalf of krehm@cray.com>

>>>> wrote:

>>>>

>>>>> On 7/16/16, 7:16 AM, "Varada Kari" <Varada.Kari@sandisk.com> wrote:

>>>>>

>>>>>> On Saturday 16 July 2016 04:53 PM, Kevan Rehm wrote:

>>>>>>> Varada, see below.

>>>>>>>

>>>>>>> On 7/16/16, 2:23 AM, "Varada Kari" <Varada.Kari@sandisk.com> wrote:

>>>>>>>

>>>>>>>> On Saturday 16 July 2016 01:52 AM, Kevan Rehm wrote:

>>>>>>>>> Greetings,

>>>>>>>>>

>>>>>>>>> I have made a bit of progress.  Still looking for ideas/advise

>>>>>>>>> while

>>>>>>>>> I

>>>>>>>>> keep digging.

>>>>>>>>>

>>>>>>>>> First, the ceph version I forgot to mention last time.

>>>>>>>>>

>>>>>>>>> # ceph -v

>>>>>>>>> ceph version 11.0.0-196-g85bb43e

>>>>>>>>> (85bb43e111692989d2296a389ce45377d2297d6f)

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> As a refresher, here is the beginning of the problem transaction

>>>>>>>>> which

>>>>>>>>> will result in an inode with 0 links.   There are actually three

>>>>>>>>> inodes

>>>>>>>>> in

>>>>>>>>> this transaction which will all end up having 0 links, inodes

>>>>>>>>>109,

>>>>>>>>> 102,

>>>>>>>>> and 113.

>>>>>>>>>

>>>>>>>>> 2016-07-13 16:03:03.929679 7fb6617c1800 10 bluefs _replay

>>>>>>>>>0x5be000:

>>>>>>>>> txn(seq 77752 len 35269 crc 1796157826)

>>>>>>>>> 2016-07-13 16:03:03.929681 7fb6617c1800 20 bluefs _replay

>>>>>>>>>0x5be000:

>>>>>>>>> op_dir_unlink  db/000154.log

>>>>>>>>> 2016-07-13 16:03:03.929683 7fb6617c1800 20 bluefs _replay

>>>>>>>>>0x5be000:

>>>>>>>>> op_file_remove 109

>>>>>>>>> 2016-07-13 16:03:03.929684 7fb6617c1800 20 bluefs _replay

>>>>>>>>>0x5be000:

>>>>>>>>> op_dir_unlink  db/000153.log

>>>>>>>>> 2016-07-13 16:03:03.929685 7fb6617c1800 20 bluefs _replay

>>>>>>>>>0x5be000:

>>>>>>>>> op_file_remove 102

>>>>>>>>> 2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay

>>>>>>>>>0x5be000:

>>>>>>>>> op_dir_unlink  db/000151.log

>>>>>>>>> 2016-07-13 16:03:03.929686 7fb6617c1800 20 bluefs _replay

>>>>>>>>>0x5be000:

>>>>>>>>> op_file_remove 113

>>>>>>>>> 2016-07-13 16:03:03.929687 7fb6617c1800 20 bluefs _replay

>>>>>>>>>0x5be000:

>>>>>>>>> op_dir_unlink  db/000150.log

>>>>>>>>> 2016-07-13 16:03:03.929688 7fb6617c1800 20 bluefs _replay

>>>>>>>>>0x5be000:

>>>>>>>>> op_file_remove 111

>>>>>>>>> 2016-07-13 16:03:03.929689 7fb6617c1800 20 bluefs _replay

>>>>>>>>>0x5be000:

>>>>>>>>> op_file_update  file(ino 113 size 71354938 mtime 2016-07-12

>>>>>>>>> 17:05:53.738683 bdev 1 extents

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> 

>>>>>>>>>[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:931

>>>>>>>>>13

>>>>>>>>> 54

>>>>>>>>> 8

>>>>>>>>> 8+

>>>>>>>>> 20

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> 

>>>>>>>>>97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,

>>>>>>>>>1:

>>>>>>>>> 96

>>>>>>>>> 0

>>>>>>>>> 49

>>>>>>>>> 56

>>>>>>>>> 16+58720256])

>>>>>>>>> 2016-07-13 16:03:03.929693 7fb6617c1800 30 bluefs _get_file ino

>>>>>>>>> 113 =

>>>>>>>>> 0x7fb66aafc880 (new)

>>>>>>>> When _get_file() is called with a inode number, if the file is not

>>>>>>>> found

>>>>>>>> in file_map

>>>>>>>> a new file is created and returned, hence that (new) in the log

>>>>>>>> file.

>>>>>>>> These set of events

>>>>>>>> can happen in case of rename. You can see the same

>>>>>>>>BlueFS::rename().

>>>>>>>> In

>>>>>>>> that case, we use the

>>>>>>>> same inode number.

>>>>>>> I'm not sure I understand, I do not believe this is a rename case.

>>>>>>> For

>>>>>>> this to be a rename, there would have to be at least one

>>>>>>> op_dir_link()

>>>>>>> call for each op_dir_unlink() call in the sequence above.   The

>>>>>>>only

>>>>>>> code

>>>>>>> sequence I can find where an op_dir_unlink() and an

>>>>>>>op_file_remove()

>>>>>>> appear together without a corresponding op_dir_link() is when a

>>>>>>>file

>>>>>>> is

>>>>>>> being deleted.

>>>>>>>

>>>>>>> The (new) above makes my point exactly.  The op_file_remove() call

>>>>>>> above

>>>>>>> freed the block allocation for the inode and released the inode

>>>>>>>from

>>>>>>> the

>>>>>>> filemap, showing that the file->refs had to be zero at that time

>>>>>>> (routine

>>>>>>> _drop_link).   Yet the following op_file_update() call is

>>>>>>>performing

>>>>>>> a

>>>>>>> truncate on that same inode, which is not in the filemap.   So the

>>>>>>> caller

>>>>>>> has to be holding a file structure whose refs field is greater than

>>>>>>> zero.

>>>>>>> Which implies that the caller's file structure has to be stale.

>>>>>>>The

>>>>>>> fact

>>>>>>> that the op_file_update file contents for inode 113 is exactly the

>>>>>>> same

>>>>>>> as

>>>>>>> the previous op_file_update contents prior to the op_file_remove(),

>>>>>>> except

>>>>>>> for a reduction in file size, proves I believe that the file struct

>>>>>>> used

>>>>>>> by the op_file_update() caller is stale.

>>>>>>>

>>>>>>>

>>>>>>>>> 2016-07-13 16:03:03.929695 7fb6617c1800 20 bluefs _replay

>>>>>>>>>0x5be000:

>>>>>>>>> op_file_update  file(ino 102 size 68557043 mtime 2016-07-12

>>>>>>>>> 17:05:54.615190 bdev 1 extents

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> 

>>>>>>>>>[1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:2023

>>>>>>>>>75

>>>>>>>>> 16

>>>>>>>>> 8

>>>>>>>>> +2

>>>>>>>>> 09

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> 

>>>>>>>>>7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1

>>>>>>>>>:2

>>>>>>>>> 27

>>>>>>>>> 5

>>>>>>>>> 40

>>>>>>>>> 99

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> 

>>>>>>>>>2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+2097

>>>>>>>>>15

>>>>>>>>> 2,

>>>>>>>>> 1

>>>>>>>>> :2

>>>>>>>>> 51

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> 

>>>>>>>>>658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032

>>>>>>>>>+2

>>>>>>>>> 09

>>>>>>>>> 7

>>>>>>>>> 15

>>>>>>>>> 2,

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> 

>>>>>>>>>1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:2946

>>>>>>>>>49

>>>>>>>>> 85

>>>>>>>>> 6

>>>>>>>>> +1

>>>>>>>>> 04

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> 

>>>>>>>>>8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1

>>>>>>>>>:3

>>>>>>>>> 18

>>>>>>>>> 7

>>>>>>>>> 67

>>>>>>>>> 10

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> 

>>>>>>>>>4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+2097

>>>>>>>>>15

>>>>>>>>> 2,

>>>>>>>>> 1

>>>>>>>>> :3

>>>>>>>>> 42

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> 

>>>>>>>>>884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144

>>>>>>>>>+2

>>>>>>>>> 09

>>>>>>>>> 7

>>>>>>>>> 15

>>>>>>>>> 2,

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> 

>>>>>>>>>1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:3837

>>>>>>>>>78

>>>>>>>>> 81

>>>>>>>>> 6

>>>>>>>>> +1

>>>>>>>>> 04

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> 

>>>>>>>>>8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1

>>>>>>>>>:4

>>>>>>>>> 05

>>>>>>>>> 7

>>>>>>>>> 98

>>>>>>>>> 91

>>>>>>>>> 2+3145728,1:507510784+5242880])

>>>>>>>>> 2016-07-13 16:03:03.929702 7fb6617c1800 30 bluefs _get_file ino

>>>>>>>>> 102 =

>>>>>>>>> 0x7fb66aafbb90 (new)

>>>>>>>>> 2016-07-13 16:03:03.929703 7fb6617c1800 20 bluefs _replay

>>>>>>>>>0x5be000:

>>>>>>>>> op_file_update  file(ino 109 size 64421913 mtime 2016-07-12

>>>>>>>>> 22:41:33.028030 bdev 1 extents

>>>>>>>>> [1:512753664+31457280,1:545259520+4194304,1:550502400+36700160])

>>>>>>>>> 2016-07-13 16:03:03.929705 7fb6617c1800 30 bluefs _get_file ino

>>>>>>>>> 109 =

>>>>>>>>> 0x7fb66aafbe60 (new)

>>>>>>>>> ...

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> The prior transaction log entry for each of the three inodes was

>>>>>>>>> another

>>>>>>>>> op_file_update() call.  The only difference between the two

>>>>>>>>>updates

>>>>>>>>> for

>>>>>>>>> ino 133 was file size went from 71359362 -> 71354938, 4424 less

>>>>>>>>> bytes,

>>>>>>>>> all

>>>>>>>>> other fields are identical.

>>>>>>>> Are the extents same for both the events?

>>>>>>> Yes, please compare this line with the one higher above in this

>>>>>>> email,

>>>>>>> you

>>>>>>> can see that only the file size is different.   That is true for

>>>>>>>all

>>>>>>> three

>>>>>>> of these inodes.

>>>>>>>

>>>>>>>>> 2016-07-13 16:03:03.925186 7fb6617c1800 20 bluefs _replay

>>>>>>>>>0x5b4000:

>>>>>>>>> op_file_update  file(ino 113 size 71359362 mtime 2016-07-12

>>>>>>>>> 17:05:53.738683 bdev 1 extents

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> 

>>>>>>>>>[1:904921088+1048576,1:910163968+2097152,1:916455424+2097152,1:931

>>>>>>>>>13

>>>>>>>>> 54

>>>>>>>>> 8

>>>>>>>>> 8+

>>>>>>>>> 20

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> 

>>>>>>>>>97152,1:937426944+2097152,1:943718400+2097152,1:950009856+2097152,

>>>>>>>>>1:

>>>>>>>>> 96

>>>>>>>>> 0

>>>>>>>>> 49

>>>>>>>>> 56

>>>>>>>>> 16+58720256])

>>>>>>>>> 2016-07-13 16:03:03.925197 7fb6617c1800 30 bluefs _get_file ino

>>>>>>>>> 113 =

>>>>>>>>> 0x7fb66aafbb90

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> Same story for inode 109, the file size changed from 71354939 ->

>>>>>>>>> 64421913,

>>>>>>>>> 6933026 less bytes, all other fields are identical.

>>>>>>>>>

>>>>>>>>> 2016-07-13 16:03:03.926101 7fb6617c1800 20 bluefs _replay

>>>>>>>>>0x5ba000:

>>>>>>>>> op_file_update  file(ino 109 size 71354939 mtime 2016-07-12

>>>>>>>>> 22:41:33.028030 bdev 1 extents

>>>>>>>>> [1:512753664+31457280,1:545259520+4194304,1:550502400+36700160])

>>>>>>>>> 2016-07-13 16:03:03.926103 7fb6617c1800 30 bluefs _get_file ino

>>>>>>>>> 109 =

>>>>>>>>> 0x7fb66aafc910

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> Same story for inode 102, the file size changed from 71359405 ->

>>>>>>>>> 68557043,

>>>>>>>>> 2802362 less bytes, all other fields are identical.

>>>>>>>>>

>>>>>>>>> 2016-07-13 16:03:03.926040 7fb6617c1800 20 bluefs _replay

>>>>>>>>>0x5b9000:

>>>>>>>>> op_file_update  file(ino 102 size 71359405 mtime 2016-07-12

>>>>>>>>> 17:05:54.615190 bdev 1 extents

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> 

>>>>>>>>>[1:78643200+1048576,1:191889408+1048576,1:197132288+1048576,1:2023

>>>>>>>>>75

>>>>>>>>> 16

>>>>>>>>> 8

>>>>>>>>> +2

>>>>>>>>> 09

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> 

>>>>>>>>>7152,1:208666624+2097152,1:214958080+2097152,1:221249536+2097152,1

>>>>>>>>>:2

>>>>>>>>> 27

>>>>>>>>> 5

>>>>>>>>> 40

>>>>>>>>> 99

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> 

>>>>>>>>>2+1048576,1:232783872+2097152,1:239075328+2097152,1:245366784+2097

>>>>>>>>>15

>>>>>>>>> 2,

>>>>>>>>> 1

>>>>>>>>> :2

>>>>>>>>> 51

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> 

>>>>>>>>>658240+2097152,1:257949696+1048576,1:263192576+2097152,1:269484032

>>>>>>>>>+2

>>>>>>>>> 09

>>>>>>>>> 7

>>>>>>>>> 15

>>>>>>>>> 2,

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> 

>>>>>>>>>1:275775488+2097152,1:282066944+2097152,1:288358400+2097152,1:2946

>>>>>>>>>49

>>>>>>>>> 85

>>>>>>>>> 6

>>>>>>>>> +1

>>>>>>>>> 04

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> 

>>>>>>>>>8576,1:299892736+2097152,1:306184192+2097152,1:312475648+2097152,1

>>>>>>>>>:3

>>>>>>>>> 18

>>>>>>>>> 7

>>>>>>>>> 67

>>>>>>>>> 10

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> 

>>>>>>>>>4+1048576,1:324009984+2097152,1:330301440+2097152,1:336592896+2097

>>>>>>>>>15

>>>>>>>>> 2,

>>>>>>>>> 1

>>>>>>>>> :3

>>>>>>>>> 42

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> 

>>>>>>>>>884352+2097152,1:349175808+1048576,1:354418688+2097152,1:360710144

>>>>>>>>>+2

>>>>>>>>> 09

>>>>>>>>> 7

>>>>>>>>> 15

>>>>>>>>> 2,

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> 

>>>>>>>>>1:367001600+1048576,1:372244480+1048576,1:377487360+2097152,1:3837

>>>>>>>>>78

>>>>>>>>> 81

>>>>>>>>> 6

>>>>>>>>> +1

>>>>>>>>> 04

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> 

>>>>>>>>>8576,1:389021696+1048576,1:394264576+2097152,1:400556032+1048576,1

>>>>>>>>>:4

>>>>>>>>> 05

>>>>>>>>> 7

>>>>>>>>> 98

>>>>>>>>> 91

>>>>>>>>> 2+3145728,1:507510784+5242880])

>>>>>>>>> 2016-07-13 16:03:03.926048 7fb6617c1800 30 bluefs _get_file ino

>>>>>>>>> 102 =

>>>>>>>>> 0x7fb66aafbe60

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> There is only one caller of op_file_update() in which the size of

>>>>>>>>> the

>>>>>>>>> file

>>>>>>>>> is reduced but all other fields stay the same, that is

>>>>>>>>> BlueFS::_truncate(), which in turn is only called by

>>>>>>>>> BlueFS::truncate(),

>>>>>>>>> which is only called by BlueRocksWritableFile::Close().

>>>>>>>>>

>>>>>>>>> The op_dir_unlink() and op_file_remove() calls for each of the

>>>>>>>>> three

>>>>>>>>> inodes seem to be coming from BlueRocksEnv::DeleteFile(), but I'm

>>>>>>>>> having

>>>>>>>>> difficulty finding where the call is coming from.

>>>>>>>> Any syscall making dir rename or file remove or directory remove

>>>>>>>> will

>>>>>>>> result in BlueFS::rename or unlink

>>>>>>>> or corresponding calls and results in op_dir_* or op_file_*

>>>>>>>> operations.

>>>>>>>> rocksdb wal_manager or compaction

>>>>>>>> can trigger these calls during that process. Will take a closer

>>>>>>>>look

>>>>>>>> at

>>>>>>>> the logs, if i can get something from them.

>>>>>>>>> So the problem appears to be that the file is first deleted, then

>>>>>>>>> it

>>>>>>>>> is

>>>>>>>>> closed, and a byproduct of the close operation  on a stale fnode

>>>>>>>>>is

>>>>>>>>> that

>>>>>>>>> another op_file_update() entry gets logged, recreating the

>>>>>>>>>deleted

>>>>>>>>> inode

>>>>>>>>> with a stale fnode, resulting in the 0 links failure.

>>>>>>>> Once the file is deleted, all the underlying extents will be

>>>>>>>> cleared.

>>>>>>>> Until and unless there is race in handling on

>>>>>>>> the files from application level, Bluefs doesn't have any control

>>>>>>>> over

>>>>>>>> them. If the a file(handle) is been reused after

>>>>>>>> deletion, that seems to be a bug. Will take a look at the logs to

>>>>>>>> confirm your theory.

>>>>>>>> But here, this seems to be more of rename use case. one more thing

>>>>>>>> to

>>>>>>>> verify is the extents owned by the inodes mentioned.

>>>>>>> No, I don't believe this is a rename, see above.

>>>>>>>

>>>>>>> I'll note that every occurrence of this problem has been preceded

>>>>>>>by

>>>>>>> an

>>>>>>> OSD segfault.  And this problem is racy, if I increase debug

>>>>>>>logging

>>>>>>> for

>>>>>>> bluefs or bluestore, then the problem goes away.  I have 50+

>>>>>>> bluestore

>>>>>>> segfaults and asserts in my cluster logs; I suspect many could be

>>>>>>>due

>>>>>>> to

>>>>>>> this problem, because the same range of blocks is being freed more

>>>>>>> than

>>>>>>> once.   If some other file should perform an allocation between the

>>>>>>> block-freeing in op_file_remove() and the second freeing in

>>>>>>> op_file_update(), then we could easily have multiple files with the

>>>>>>> same

>>>>>>> block allocation at the same time.

>>>>>>>

>>>>>>> I think this problem can be caught fairly easily, an assert could

>>>>>>>be

>>>>>>> added

>>>>>>> to _truncate().   The _truncate() routine is only called for files

>>>>>>> which

>>>>>>> are open, which means that the inode has to be in the filemap.

>>>>>>>Code

>>>>>>> could be added at the front of _truncate() to look up the inode in

>>>>>>> the

>>>>>>> filemap, and if it is not there, then assert.   That would provide

>>>>>>>a

>>>>>>> stack

>>>>>>> trace showing the caller using the file struct with the stale

>>>>>>> reference

>>>>>>> count.   (It may be a few days before I have a chance to try this

>>>>>>> myself.)

>>>>>> Sure. Could you please let me know the steps to hit issue?

>>>>>> i am doing only few min runs to check the performance and some unit

>>>>>> test

>>>>>> cases.

>>>>>> I will add the necessary asserts if i can reproduce the issue and

>>>>>>fix

>>>>>> it.

>>>>> Thanks for the help!   I will have to get back to you on Monday, as

>>>>> there

>>>>> is another person on our team who runs the benchmarks that trigger

>>>>>this

>>>>> problem, I don't know how to run his benchmarks myself.

>>>>>

>>>>> Kevan

>>>>>

>>>>>> Varada

>>>>>>> Regards, Kevan

>>>>>>>

>>>>>>>

>>>>>>>

>>>>>> PLEASE NOTE: The information contained in this electronic mail

>>>>>> message is

>>>>>> intended only for the use of the designated recipient(s) named

>>>>>>above.

>>>>>> If

>>>>>> the reader of this message is not the intended recipient, you are

>>>>>> hereby

>>>>>> notified that you have received this message in error and that any

>>>>>> review, dissemination, distribution, or copying of this message is

>>>>>> strictly prohibited. If you have received this communication in

>>>>>>error,

>>>>>> please notify the sender by telephone or e-mail (as shown above)

>>>>>> immediately and destroy any and all copies of this message in your

>>>>>> possession (whether hard copies or electronically stored copies).

>>>>> --

>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"

>>>>> in

>>>>> the body of a message to majordomo@vger.kernel.org

>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

>>> PLEASE NOTE: The information contained in this electronic mail message

>>>is

>>> intended only for the use of the designated recipient(s) named above.

>>>If

>>> the reader of this message is not the intended recipient, you are

>>>hereby

>>> notified that you have received this message in error and that any

>>> review, dissemination, distribution, or copying of this message is

>>> strictly prohibited. If you have received this communication in error,

>>> please notify the sender by telephone or e-mail (as shown above)

>>> immediately and destroy any and all copies of this message in your

>>> possession (whether hard copies or electronically stored copies).

>>

>

>PLEASE NOTE: The information contained in this electronic mail message is

>intended only for the use of the designated recipient(s) named above. If

>the reader of this message is not the intended recipient, you are hereby

>notified that you have received this message in error and that any

>review, dissemination, distribution, or copying of this message is

>strictly prohibited. If you have received this communication in error,

>please notify the sender by telephone or e-mail (as shown above)

>immediately and destroy any and all copies of this message in your

>possession (whether hard copies or electronically stored copies).
diff mbox

Patch

diff --git a/src/os/bluestore/BlueFS.cc b/src/os/bluestore/BlueFS.cc
index dbfd59b..759d1af 100644
--- a/src/os/bluestore/BlueFS.cc
+++ b/src/os/bluestore/BlueFS.cc
@@ -1299,6 +1299,13 @@  int BlueFS::_truncate(FileWriter *h, uint64_t
offset)
   dout(10) << __func__ << " 0x" << std::hex << offset << std::dec
            << " file " << h->file->fnode << dendl;
 
+  // If ino is not in filemap, we are truncating a file that is not
allocated.
+
+  auto p = file_map.find(h->file->fnode.ino);
+  if (p == file_map.end()) {
+    assert(0 == "_truncate(): truncating a file that is not allocated");
+  }
+
   // truncate off unflushed data?