[082/622] lnet: handle remote errors in LNet

Message ID	1582838290-17243-83-git-send-email-jsimmons@infradead.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=hXa/=4P=lists.lustre.org=lustre-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E710F246A0 From: James Simmons <jsimmons@infradead.org> To: Andreas Dilger <adilger@whamcloud.com>, Oleg Drokin <green@whamcloud.com>, NeilBrown <neilb@suse.de> Date: Thu, 27 Feb 2020 16:09:10 -0500 Message-Id: <1582838290-17243-83-git-send-email-jsimmons@infradead.org> In-Reply-To: <1582838290-17243-1-git-send-email-jsimmons@infradead.org> References: <1582838290-17243-1-git-send-email-jsimmons@infradead.org> Subject: [lustre-devel] [PATCH 082/622] lnet: handle remote errors in LNet Precedence: list Cc: Amir Shehata <ashehata@whamcloud.com>, Lustre Development List <lustre-devel@lists.lustre.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: lustre-devel-bounces@lists.lustre.org Sender: "lustre-devel" <lustre-devel-bounces@lists.lustre.org>
Series	lustre: sync closely to 2.13.52 \| expand [000/622] lustre: sync closely to 2.13.52 [001/622] lustre: always enable special debugging, fhandles, and quota support. [002/622] lustre: osc_cache: remove __might_sleep() [003/622] lustre: uapi: remove enum hsm_progress_states [004/622] lustre: uapi: sync enum obd_statfs_state [005/622] lustre: llite: return compatible fsid for statfs [006/622] lustre: ldlm: Make kvzalloc \| kvfree use consistent [007/622] lustre: llite: limit smallest max_cached_mb value [008/622] lustre: obdecho: turn on async flag only for mode 3 [009/622] lustre: llite: reorganize variable and data structures [010/622] lustre: llite: increase whole-file readahead to RPC size [011/622] lustre: llite: handle ORPHAN/DEAD directories [012/622] lustre: lov: protected ost pool count updation [013/622] lustre: obdclass: fix llog_cat_cleanup() usage on Client [014/622] lustre: mdc: fix possible NULL pointer dereference [015/622] lustre: obdclass: allow specifying complex jobids [016/622] lustre: ldlm: don't disable softirq for exp_rpc_lock [017/622] lustre: obdclass: new wrapper to convert NID to string [018/622] lustre: ptlrpc: Add QoS for uid and gid in NRS-TBF [019/622] lustre: hsm: ignore compound_id [020/622] lnet: libcfs: remove unnecessary set_fs(KERNEL_DS) [021/622] lustre: ptlrpc: ptlrpc_register_bulk() LBUG on ENOMEM [022/622] lustre: llite: yield cpu after call to ll_agl_trigger [023/622] lustre: osc: Do not request more than 2GiB grant [024/622] lustre: llite: rename FSFILT_IOC_* to system flags [025/622] lnet: fix nid range format '@<net>' support [026/622] lustre: ptlrpc: fix test_req_buffer_pressure behavior [027/622] lustre: lu_object: improve debug message for lu_object_put() [028/622] lustre: idl: remove obsolete directory split flags [029/622] lustre: mdc: resend quotactl if needed [030/622] lustre: obd: create ping sysfs file [031/622] lustre: ldlm: change LDLM_POOL_ADD_VAR macro to inline function [032/622] lustre: obdecho: use vmalloc for lnb [033/622] lustre: mdc: deny layout swap for DoM file [034/622] lustre: mgc: remove obsolete IR swabbing workaround [035/622] lustre: ptlrpc: add dir migration connect flag [036/622] lustre: mds: remove obsolete MDS_VTX_BYPASS flag [037/622] lustre: ldlm: expose dirty age limit for flush-on-glimpse [038/622] lustre: ldlm: IBITS lock convert instead of cancel [039/622] lustre: ptlrpc: fix return type of boolean functions [040/622] lustre: llite: decrease sa_running if fail to start statahead [041/622] lustre: lmv: dir page is released while in use [042/622] lustre: ldlm: speed up preparation for list of lock cancel [043/622] lustre: checksum: enable/disable checksum correctly [044/622] lustre: build: armv7 client build fixes [045/622] lustre: ldlm: fix l_last_activity usage [046/622] lustre: ptlrpc: Add WBC connect flag [047/622] lustre: llog: remove obsolete llog handlers [048/622] lustre: ldlm: fix for l_lru usage [049/622] lustre: lov: Move lov_tgts_kobj init to lov_setup [050/622] lustre: osc: add T10PI support for RPC checksum [051/622] lustre: ldlm: Reduce debug to console during eviction [052/622] lustre: ptlrpc: idle connections can disconnect [053/622] lustre: osc: truncate does not update blocks count on client [054/622] lustre: ptlrpc: add LOCK_CONVERT connection flag [055/622] lustre: ldlm: handle lock converts in cancel handler [056/622] lustre: ptlrpc: Serialize procfs access to scp_hist_reqs using mutex [057/622] lustre: ldlm: don't add canceling lock back to LRU [058/622] lustre: quota: add default quota setting support [059/622] lustre: ptlrpc: don't zero request handle [060/622] lnet: ko2iblnd: determine gaps correctly [061/622] lustre: osc: increase default max_dirty_mb to 2G [062/622] lustre: ptlrpc: remove obsolete OBD RPC opcodes [063/622] lustre: ptlrpc: assign specific values to MGS opcodes [064/622] lustre: ptlrpc: remove obsolete LLOG_ORIGIN_ RPCs [065/622] lustre: osc: fix idle_timeout handling [066/622] lustre: ptlrpc: ASSERTION(!list_empty(imp->imp_replay_cursor)) [067/622] lustre: obd: keep dirty_max_pages a round number of MB [068/622] lustre: osc: depart grant shrinking from pinger [069/622] lustre: mdt: Lazy size on MDT [070/622] lustre: lfsck: layout LFSCK for mirrored file [071/622] lustre: mdt: read on open for DoM files [072/622] lustre: migrate: pack lmv ea in migrate rpc [073/622] lustre: hsm: add OBD_CONNECT2_ARCHIVE_ID_ARRAY to pass archive_id lists in array [074/622] lustre: llite: handle zero length xattr values correctly [075/622] lnet: refactor lnet_select_pathway() [076/622] lnet: add health value per ni [077/622] lnet: add lnet_health_sensitivity [078/622] lnet: add monitor thread [079/622] lnet: handle local ni failure [080/622] lnet: handle o2iblnd tx failure [081/622] lnet: handle socklnd tx failure [082/622] lnet: handle remote errors in LNet [083/622] lnet: add retry count [084/622] lnet: calculate the lnd timeout [085/622] lnet: sysfs functions for module params [086/622] lnet: timeout delayed REPLYs and ACKs [087/622] lnet: remove duplicate timeout mechanism [088/622] lnet: handle fatal device error [089/622] lnet: reset health value [090/622] lnet: add health statistics [091/622] lnet: Add ioctl to get health stats [092/622] lnet: remove obsolete health functions [093/622] lnet: set health value from user space [094/622] lnet: add global health statistics [095/622] lnet: print recovery queues content [096/622] lnet: health error simulation [097/622] lustre: ptlrpc: replace simple_strtol with kstrtol [098/622] lustre: obd: use correct ip_compute_csum() version [099/622] lustre: osc: serialize access to idle_timeout vs cleanup [100/622] lustre: mdc: remove obsolete intent opcodes [101/622] lustre: llite: fix setstripe for specific osts upon dir [102/622] lustre: osc: enable/disable OSC grant shrink [103/622] lustre: protocol: MDT as a statfs proxy [104/622] lustre: ldlm: correct logic in ldlm_prepare_lru_list() [105/622] lustre: llite: check truncate race for DOM pages [106/622] lnet: lnd: conditionally set health status [107/622] lnet: router handling [108/622] lustre: obd: check '-o network' and peer discovery conflict [109/622] lnet: update logging [110/622] lustre: ldlm: don't cancel DoM locks before replay [111/622] lnet: lnd: Clean up logging [112/622] lustre: mdt: revoke lease lock for truncate [113/622] lustre: ptlrpc: race in AT early reply [114/622] lustre: migrate: migrate striped directory [115/622] lustre: obdclass: remove unused ll_import_cachep [116/622] lustre: ptlrpc: add debugging for idle connections [117/622] lustre: obdclass: Add lbug_on_eviction option [118/622] lustre: lmv: support accessing migrating directory [119/622] lustre: mdc: move RPC semaphore code to lustre/osp [120/622] lnet: libcfs: fix wrong check in libcfs_debug_vmsg2() [121/622] lustre: ptlrpc: new request vs disconnect race [122/622] lustre: misc: name open file handles as such [123/622] lustre: ldlm: cleanup LVB handling [124/622] lustre: ldlm: pass preallocated env to methods [125/622] lustre: osc: move obdo_cache to OSC code [126/622] lustre: llite: zero lum for stripeless files [127/622] lustre: idl: remove obsolete RPC flags [128/622] lustre: flr: add 'nosync' flag for FLR mirrors [129/622] lustre: llite: create checksums to replace checksum_pages [130/622] lustre: ptlrpc: don't change buffer when signature is ready [131/622] lustre: ldlm: update l_blocking_lock under lock [132/622] lustre: mgc: don't proccess cld during stopping [133/622] lustre: obdclass: make mod rpc slot wait queue FIFO [134/622] lustre: mdc: use old statfs format [135/622] lnet: Fix selftest backward compatibility post health [136/622] lustre: osc: clarify short_io_bytes is maximum value [137/622] lustre: ptlrpc: Make CPU binding switchable [138/622] lustre: misc: quiet console messages at startup [139/622] lustre: ldlm: don't apply ELC to converting and DOM locks [140/622] lustre: class: use INIT_LIST_HEAD_RCU instead INIT_LIST_HEAD [141/622] lustre: uapi: add new changerec_type [142/622] lustre: ldlm: check double grant race after resource change [143/622] lustre: mdc: grow lvb buffer to hold layout [144/622] lustre: osc: re-check target versus available grant [145/622] lnet: unlink md if fail to send recovery [146/622] lustre: obd: use correct names for conn_uuid [147/622] lustre: idl: use proper ATTR/MDS_ATTR/MDS_OPEN flags [148/622] lustre: llite: optimize read on open pages [149/622] lnet: set the health status correctly [150/622] lustre: lov: add debugging info for statfs [151/622] lnet: Decrement health on timeout [152/622] lustre: quota: fix setattr project check [153/622] lnet: socklnd: dynamically set LND parameters [154/622] lustre: flr: add mirror write command [155/622] lnet: properly error check sensitivity [156/622] lustre: llite: add lock for dir layout data [157/622] lnet: configure recovery interval [158/622] lustre: osc: Do not walk full extent list [159/622] lnet: separate ni state from recovery [160/622] lustre: mdc: move empty xattr handling to mdc layer [161/622] lustre: obd: remove portals handle from OBD import [162/622] lustre: mgc: restore mgc binding for sptlrpc [163/622] lnet: peer deletion code may hide error [164/622] lustre: hsm: make changelog flag argument an enum [165/622] lustre: ldlm: don't skip bl_ast for local lock [166/622] lustre: clio: use pagevec_release for many pages [167/622] lustre: lmv: allocate fid on parent MDT in migrate [168/622] lustre: ptlrpc: Do not map unrecognized ELDLM errnos to EIO [169/622] lustre: llite: protect reading inode->i_data.nrpages [170/622] lustre: mdt: fix read-on-open for big PAGE_SIZE [171/622] lustre: llite: handle -ENODATA in ll_layout_fetch() [172/622] lustre: hsm: increase upper limit of maximum HSM backends registered with MDT [173/622] lustre: osc: wrong page offset for T10PI checksum [174/622] lnet: increase lnet transaction timeout [175/622] lnet: handle multi-md usage [176/622] lustre: uapi: fix warnings when lustre_user.h included [177/622] lustre: obdclass: lu_dirent record length missing '0' [178/622] lustre: update version to 2.11.99 [179/622] lustre: osc: limit chunk number of write submit [180/622] lustre: osc: speed up page cache cleanup during blocking ASTs [181/622] lustre: lmv: Fix style issues for lmv_fld.c [182/622] lustre: llite: Fix style issues for llite_nfs.c [183/622] lustre: llite: Fix style issues for lcommon_misc.c [184/622] lustre: llite: Fix style issues for symlink.c [185/622] lustre: headers: define pct(a, b) once [186/622] lustre: obdclass: report all obd states for OBD_IOC_GETDEVICE [187/622] lustre: ldlm: remove trace from ldlm_pool_count() [188/622] lustre: ptlrpc: clean up rq_interpret_reply callbacks [189/622] lustre: lov: quiet lov_dump_lmm_ console messages [190/622] lustre: lov: cl_cache could miss initialize [191/622] lnet: socklnd: improve scheduling algorithm [192/622] lustre: ldlm: Adjust search_* functions [193/622] lustre: sysfs: make ping sysfs file read and writable [194/622] lustre: ptlrpc: connect vs import invalidate race [195/622] lustre: ptlrpc: always unregister bulk [196/622] lustre: sptlrpc: split sptlrpc_process_config() [197/622] lustre: cfg: reserve flags for SELinux status checking [198/622] lustre: llite: remove cl_file_inode_init() LASSERT [199/622] lnet: add fault injection for bulk transfers [200/622] lnet: remove .nf_min_max handling [201/622] lustre: sec: create new function sptlrpc_get_sepol() [202/622] lustre: clio: fix incorrect invariant in cl_io_iter_fini() [203/622] lustre: mdc: Improve xattr buffer allocations [204/622] lnet: libcfs: allow file/func/line passed to CDEBUG() [205/622] lustre: llog: add startcat for wrapped catalog [206/622] lustre: llog: add synchronization for the last record [207/622] lustre: ptlrpc: improve memory allocation for service RPCs [208/622] lustre: llite: enable flock mount option by default [209/622] lustre: lmv: avoid gratuitous 64-bit modulus [210/622] lustre: Ensure crc-t10pi is enabled. [211/622] lustre: lov: fix lov_iocontrol for inactive OST case [212/622] lustre: llite: Initialize cl_dirty_max_pages [213/622] lustre: mdc: don't use ACL at setattr [214/622] lnet: o2iblnd: ibc_rxs is created and freed with different size [215/622] lustre: osc: reduce atomic ops in osc_enter_cache_try [216/622] lustre: llite: ll_fault should fail for insane file offsets [217/622] lustre: ptlrpc: reset generation for old requests [218/622] lustre: osc: check if opg is in lru list without locking [219/622] lnet: use right rtr address [220/622] lnet: use right address for routing message [221/622] lustre: lov: avoid signed vs. unsigned comparison [222/622] lustre: obd: use ldo_process_config for mdc and osc layer [223/622] lnet: check for asymmetrical route messages [224/622] lustre: llite: Lock inode on tiny write if setuid/setgid set [225/622] lustre: llite: make sure name pack atomic [226/622] lustre: ptlrpc: handle proper import states for recovery [227/622] lustre: ldlm: don't convert wrong resource [228/622] lustre: llite: limit statfs ffree if less than OST ffree [229/622] lustre: mdc: prevent glimpse lock count grow [230/622] lustre: dne: performance improvement for file creation [231/622] lustre: mdc: return DOM size on open resend [232/622] lustre: llite: optimizations for not granted lock processing [233/622] lustre: osc: propagate grant shrink interval immediately [234/622] lustre: osc: grant shrink shouldn't account skipped OSC [235/622] lustre: quota: protect quota flags at OSC [236/622] lustre: osc: pass client page size during reconnect too [237/622] lustre: ptlrpc: Change static defines to use macro for sec_gc.c [238/622] lnet: libcfs: do not calculate debug_mb if it is set [239/622] lustre: ldlm: Lost lease lock on migrate error [240/622] lnet: lnd: increase CQ entries [241/622] lustre: security: return security context for metadata ops [242/622] lustre: grant: prevent overflow of o_undirty [243/622] lustre: ptlrpc: manage SELinux policy info at connect time [244/622] lustre: ptlrpc: manage SELinux policy info for metadata ops [245/622] lustre: obd: make health_check sysfs compliant [246/622] lustre: misc: delete OBD_IOC_PING_TARGET ioctl [247/622] lustre: misc: remove LIBCFS_IOC_DEBUG_MASK ioctl [248/622] lustre: llite: add file heat support [249/622] lustre: obdclass: improve llog config record message [250/622] lustre: lov: remove KEY_CACHE_SET to simplify the code [251/622] lustre: ldlm: Fix style issues for ldlm_lockd.c [252/622] lustre: ldlm: Fix style issues for ldlm_request.c [253/622] lustre: ptlrpc: Fix style issues for sec_bulk.c [254/622] lustre: ldlm: Fix style issues for ptlrpcd.c [255/622] lustre: ptlrpc: IR doesn't reconnect after EAGAIN [256/622] lustre: llite: ll_fault fixes [257/622] lustre: lsom: Add an OBD_CONNECT2_LSOM connect flag [258/622] lustre: pcc: Reserve a new connection flag for PCC [259/622] lustre: uapi: reserve connect flag for plain layout [260/622] lustre: ptlrpc: allow stopping threads above threads_max [261/622] lnet: Avoid lnet debugfs read/write if ctl_table does not exist [262/622] lnet: lnd: bring back concurrent_sends [263/622] lnet: properly cleanup lnet debugfs files [264/622] lustre: mdc: reset lmm->lmm_stripe_offset in mdc_save_lovea [265/622] lnet: Cleanup lnet_get_rtr_pool_cfg [266/622] lustre: quota: make overquota flag for old req [267/622] lustre: osd: Set max ea size to XATTR_SIZE_MAX [268/622] lustre: lov: Remove unnecessary assert [269/622] lnet: o2iblnd: kib_conn leak [270/622] lustre: llite: switch to use ll_fsname directly [271/622] lustre: llite: improve max_readahead console messages [272/622] lustre: llite: fill copied dentry name's ending char properly [273/622] lustre: obd: update udev event handling [274/622] lustre: ptlrpc: Bulk assertion fails on -ENOMEM [275/622] lustre: obd: Add overstriping CONNECT flag [276/622] lustre: llite, readahead: fix to call ll_ras_enter() properly [277/622] lustre: ptlrpc: ASSERTION (req_transno < next_transno) failed [278/622] lustre: lov: new foreign LOV format [279/622] lustre: lmv: new foreign LMV format [280/622] lustre: obd: replace class_uuid with linux kernel version. [281/622] lustre: ptlrpc: Fix style issues for sec_null.c [282/622] lustre: ptlrpc: Fix style issues for service.c [283/622] lustre: uapi: fix file heat support [284/622] lnet: libcfs: poll fail_loc in cfs_fail_timeout_set() [285/622] lustre: obd: round values to nearest MiB for _mb syfs files [286/622] lustre: osc: don't check capability for every page [287/622] lustre: statahead: sa_handle_callback get lli_sa_lock earlier [288/622] lnet: use number of wrs to calculate CQEs [289/622] lustre: ldlm: Fix style issues for ldlm_resource.c [290/622] lustre: ptlrpc: Fix style issues for sec_gc.c [291/622] lustre: ptlrpc: Fix style issues for llog_client.c [292/622] lustre: dne: allow access to striped dir with broken layout [293/622] lustre: ptlrpc: ocd_connect_flags are wrong during reconnect [294/622] lnet: libcfs: fix panic for too large cpu partitions [295/622] lustre: obdclass: put all service's env on the list [296/622] lustre: mdt: fix mdt_dom_discard_data() timeouts [297/622] lustre: lov: Add overstriping support [298/622] lustre: rpc: support maximum 64MB I/O RPC [299/622] lustre: dom: per-resource ELC for WRITE lock enqueue [300/622] lustre: dom: mdc_lock_flush() improvement [301/622] lnet: Fix NI status in debugfs for loopback ni [302/622] lustre: ptlrpc: Add more flags to DEBUG_REQ_FLAGS macro [303/622] lustre: llite: Revalidate dentries in ll_intent_file_open [304/622] lustre: llite: hash just created files if lock allows [305/622] lnet: adds checking msg len [306/622] lustre: dne: add new dir hash type "space" [307/622] lustre: uapi: Add nonrotational flag to statfs [308/622] lnet: libcfs: crashes with certain cpu part numbers [309/622] lustre: lov: fix wrong calculated length for fiemap [310/622] lustre: obdclass: remove unprotected access to lu_object [311/622] lustre: push rcu_barrier() before destroying slab [312/622] lustre: ptlrpc: intent_getattr fetches default LMV [313/622] lustre: mdc: add async statfs [314/622] lustre: lmv: mkdir with balanced space usage [315/622] lustre: llite: check correct size in ll_dom_finish_open() [316/622] lnet: recovery event handling broken [317/622] lnet: clean mt_eqh properly [318/622] lnet: handle remote health error [319/622] lnet: setup health timeout defaults [320/622] lnet: fix cpt locking [321/622] lnet: detach response tracker [322/622] lnet: invalidate recovery ping mdh [323/622] lnet: fix list corruption [324/622] lnet: correct discovery LNetEQFree() [325/622] lnet: Protect lp_dc_pendq manipulation with lp_lock [326/622] lnet: Ensure md is detached when msg is not committed [327/622] lnet: verify msg is commited for send/recv [328/622] lnet: select LO interface for sending [329/622] lnet: remove route add restriction [330/622] lnet: Discover routers on first use [331/622] lnet: use peer for gateway [332/622] lnet: lnet_add/del_route() [333/622] lnet: Do not allow deleting of router nis [334/622] lnet: router sensitivity [335/622] lnet: cache ni status [336/622] lnet: Cache the routing feature [337/622] lnet: peer aliveness [338/622] lnet: router aliveness [339/622] lnet: simplify lnet_handle_local_failure() [340/622] lnet: Cleanup rcd [341/622] lnet: modify lnd notification mechanism [342/622] lnet: use discovery for routing [343/622] lnet: MR aware gateway selection [344/622] lnet: consider alive_router_check_interval [345/622] lnet: allow deleting router primary_nid [346/622] lnet: transfer routers [347/622] lnet: handle health for incoming messages [348/622] lnet: misleading discovery seqno. [349/622] lnet: drop all rule [350/622] lnet: handle discovery off [351/622] lnet: handle router health off [352/622] lnet: push router interface updates [353/622] lnet: net aliveness [354/622] lnet: discover each gateway Net [355/622] lnet: look up MR peers routes [356/622] lnet: check peer timeout on a router [357/622] lustre: lmv: reuse object alloc QoS code from LOD [358/622] lustre: llite: Add persistent cache on client [359/622] lustre: pcc: Non-blocking PCC caching [360/622] lustre: pcc: security and permission for non-root user access [361/622] lustre: llite: Rule based auto PCC caching when create files [362/622] lustre: pcc: auto attach during open for valid cache [363/622] lustre: pcc: change detach behavior and add keep option [364/622] lustre: lov: return error if cl_env_get fails [365/622] lustre: ptlrpc: Add more flags to DEBUG_REQ_FLAGS macro [366/622] lustre: ldlm: layout lock fixes [367/622] lnet: Do not allow gateways on remote nets [368/622] lustre: osc: reduce lock contention in osc_unreserve_grant [369/622] lnet: Change static defines to use macro for module.c [370/622] lustre: llite, readahead: don't always use max RPC size [371/622] lustre: llite: improve single-thread read performance [372/622] lustre: obdclass: allow per-session jobids. [373/622] lustre: llite: fix deadloop with tiny write [374/622] lnet: prevent loop in LNetPrimaryNID() [375/622] lustre: ldlm: Fix style issues for ldlm_lib.c [376/622] lustre: obdclass: protect imp_sec using rwlock_t [377/622] lustre: llite: console message for disabled flock call [378/622] lustre: ptlrpc: Add increasing XIDs CONNECT2 flag [379/622] lustre: ptlrpc: don't reset lru_resize on idle reconnect [380/622] lnet: use after free in lnet_discover_peer_locked() [381/622] lustre: obdclass: generate random u64 max correctly [382/622] lnet: fix peer ref counting [383/622] lustre: llite: collect debug info for ll_fsync [384/622] lustre: obdclass: use RCU to release lu_env_item [385/622] lustre: mdt: improve IBITS lock definitions [386/622] lustre: uapi: change "space" hash type to hash flag [387/622] lustre: osc: cancel osc_lock list traversal once found the lock is being used [388/622] lustre: obdclass: add comment for rcu handling in lu_env_remove [389/622] lnet: honor discovery setting [390/622] lustre: obdclass: don't send multiple statfs RPCs [391/622] lustre: lov: Correct bounds checking [392/622] lustre: lu_object: Add missed qos_rr_init [393/622] lustre: fld: let's caller to retry FLD_QUERY [394/622] lustre: llite: make sure readahead cover current read [395/622] lustre: ptlrpc: Add jobid to rpctrace debug messages [396/622] lnet: libcfs: Reduce memory frag due to HA debug msg [397/622] lustre: ptlrpc: change IMPORT_SET_ macros into real functions [398/622] lustre: uapi: add unused enum obd_statfs_state [399/622] lustre: llite: create obd_device with usercopy whitelist [400/622] lnet: warn if discovery is off [401/622] lustre: ldlm: always cancel aged locks regardless enabling or disabling lru resize [402/622] lustre: llite: cleanup stats of LPROC_LL_* [403/622] lustre: osc: Do not assert for first extent [404/622] lustre: llite: MS_* flags and SB_* flags split [405/622] lustre: llite: improve ll_dom_lock_cancel [406/622] lustre: llite: swab LOV EA user data [407/622] lustre: clio: support custom csi_end_io handler [408/622] lustre: llite: release active extent on sync write commit [409/622] lustre: obd: harden debugfs handling [410/622] lustre: obd: add rmfid support [411/622] lnet: Convert noisy timeout error to cdebug [412/622] lnet: Misleading error from lnet_is_health_check [413/622] lustre: llite: do not cache write open lock for exec file [414/622] lustre: mdc: polling mode for changelog reader [415/622] lnet: Sync the start of discovery and monitor threads [416/622] lustre: llite: don't check vmpage refcount in ll_releasepage() [417/622] lnet: Deprecate live and dead router check params [418/622] lnet: Detach rspt when md_threshold is infinite [419/622] lnet: Return EHOSTUNREACH for unreachable gateway [420/622] lustre: ptlrpc: Don't get jobid in body_v2 [421/622] lnet: Defer rspt cleanup when MD queued for unlink [422/622] lustre: lov: Correct write_intent end for trunc [423/622] lustre: mdc: hold lock while walking changelog dev list [424/622] lustre: import: fix race between imp_state & imp_invalid [425/622] lnet: support non-default network namespace [426/622] lustre: obdclass: 0-nlink race in lu_object_find_at() [427/622] lustre: osc: reserve lru pages for read in batch [428/622] lustre: uapi: Make lustre_user.h c++-legal [429/622] lnet: create existing net returns EEXIST [430/622] lustre: obdecho: reuse an cl env cache for obdecho survey [431/622] lustre: mdc: dir page ldp_hash_end mistakenly adjusted [432/622] lnet: handle unlink before send completes [433/622] lustre: osc: layout and chunkbits alignment mismatch [434/622] lnet: handle recursion in resend [435/622] lustre: llite: forget cached ACLs properly [436/622] lustre: osc: Fix dom handling in weight_ast [437/622] lustre: llite: Fix extents_stats [438/622] lustre: llite: don't miss every first stride page [439/622] lustre: llite: swab LOV EA data in ll_getxattr_lov() [440/622] lustre: llite: Mark lustre_inode_cache as reclaimable [441/622] lustre: osc: add preferred checksum type support [442/622] lustre: ptlrpc: Stop sending ptlrpc_body_v2 [443/622] lnet: Fix style issues for selftest/rpc.c [444/622] lnet: Fix style issues for module.c conctl.c [445/622] lustre: ptlrpc: check lm_bufcount and lm_buflen [446/622] lustre: uapi: Remove unused CONNECT flag [447/622] lustre: lmv: disable remote file statahead [448/622] lustre: llite: Fix page count for unaligned reads [449/622] lnet: discovery off route state update [450/622] lustre: llite: prevent mulitple group locks [451/622] lustre: ptlrpc: make DEBUG_REQ messages consistent [452/622] lustre: ptlrpc: check buffer length in lustre_msg_string() [453/622] lustre: uapi: fix building fail against Power9 little endian [454/622] lustre: ptlrpc: fix reply buffers shrinking and growing [455/622] lustre: dom: manual OST-to-DOM migration via mirroring [456/622] lustre: fld: remove fci_no_shrink field. [457/622] lustre: lustre: remove ldt_obd_type field of lu_device_type [458/622] lustre: lustre: remove imp_no_timeout field [459/622] lustre: llog: remove olg_cat_processing field. [460/622] lustre: ptlrpc: remove struct ptlrpc_bulk_page [461/622] lustre: ptlrpc: remove bd_import_generation field. [462/622] lustre: ptlrpc: remove srv_threads from struct ptlrpc_service [463/622] lustre: ptlrpc: remove scp_nthrs_stopping field. [464/622] lustre: ldlm: remove unused ldlm_server_conn [465/622] lustre: llite: remove lli_readdir_mutex [466/622] lustre: llite: remove ll_umounting field [467/622] lustre: llite: align field names in ll_sb_info [468/622] lustre: llite: remove lti_iter field [469/622] lustre: llite: remove ft_mtime field [470/622] lustre: llite: remove sub_reenter field. [471/622] lustre: osc: remove oti_descr oti_handle oti_plist [472/622] lustre: osc: remove oe_next_page [473/622] lnet: o2iblnd: remove some unused fields. [474/622] lnet: socklnd: remove ksnp_sharecount [475/622] lustre: llite: extend readahead locks for striped file [476/622] lustre: llite: Improve readahead RPC issuance [477/622] lustre: lov: Move page index to top level [478/622] lustre: readahead: convert stride page index to byte [479/622] lustre: osc: prevent use after free [480/622] lustre: mdc: hold obd while processing changelog [481/622] lnet: change ln_mt_waitq to a completion. [482/622] lustre: obdclass: align to T10 sector size when generating guard [483/622] lustre: ptlrpc: Hold imp lock for idle reconnect [484/622] lustre: osc: glimpse - search for active lock [485/622] lustre: lmv: use lu_tgt_descs to manage tgts [486/622] lustre: lmv: share object alloc QoS code with LMV [487/622] lustre: import: Fix missing spin_unlock() [488/622] lnet: o2iblnd: Make credits hiw connection aware [489/622] lustre: obdecho: avoid panic with partially object init [490/622] lnet: o2iblnd: cache max_qp_wr [491/622] lustre: som: integrate LSOM with lfs find [492/622] lustre: llite: error handling of ll_och_fill() [493/622] lnet: Don't queue msg when discovery has completed [494/622] lnet: Use alternate ping processing for non-mr peers [495/622] lustre: obdclass: qos penalties miscalculated [496/622] lustre: osc: wrong cache of LVB attrs [497/622] lustre: osc: wrong cache of LVB attrs, part2 [498/622] lustre: vvp: dirty pages with pagevec [499/622] lustre: ptlrpc: resend may corrupt the data [500/622] lnet: eliminate uninitialized warning [501/622] lnet: o2ib: Record rc in debug log on startup failure [502/622] lnet: o2ib: Reintroduce kiblnd_dev_search [503/622] lustre: ptlrpc: fix watchdog ratelimit logic [504/622] lustre: flr: avoid reading unhealthy mirror [505/622] lustre: obdclass: lu_tgt_descs cleanup [506/622] lustre: ptlrpc: Properly swab ll_fiemap_info_key [507/622] lustre: llite: clear flock when using localflock [508/622] lustre: sec: reserve flags for client side encryption [509/622] lustre: llite: limit max xattr size by kernel value [510/622] lustre: ptlrpc: return proper error code [511/622] lnet: fix peer_ni selection [512/622] lustre: pcc: Auto attach for PCC during IO [513/622] lustre: lmv: alloc dir stripes by QoS [514/622] lustre: llite: Don't clear d_fsdata in ll_release() [515/622] lustre: llite: move agl_thread cleanup out of thread. [516/622] lustre/lnet: remove unnecessary use of msecs_to_jiffies() [517/622] lnet: net_fault: don't pass struct member to do_div() [518/622] lustre: obd: discard unused enum [519/622] lustre: update version to 2.13.50 [520/622] lustre: llite: report latency for filesystem ops [521/622] lustre: osc: don't re-enable grant shrink on reconnect [522/622] lustre: llite: statfs to use NODELAY with MDS [523/622] lustre: ptlrpc: grammar fix. [524/622] lustre: lov: check all entries in lov_flush_composite [525/622] lustre: pcc: Incorrect size after re-attach [526/622] lustre: pcc: auto attach not work after client cache clear [527/622] lustre: pcc: Init saved dataset flags properly [528/622] lustre: use simple sleep in some cases [529/622] lustre: lov: use wait_event() in lov_subobject_kill() [530/622] lustre: llite: use wait_event in cl_object_put_last() [531/622] lustre: modules: Use LIST_HEAD for declaring list_heads [532/622] lustre: handle: move refcount into the lustre_handle. [533/622] lustre: llite: support page unaligned stride readahead [534/622] lustre: ptlrpc: ptlrpc_register_bulk LBUG on ENOMEM [535/622] lustre: osc: allow increasing osc..short_io_bytes [536/622] lnet: remove pt_number from lnet_peer_table. [537/622] lnet: Optimize check for routing feature flag [538/622] lustre: llite: file write pos mimatch [539/622] lustre: ldlm: FLOCK request can be processed twice [540/622] lnet: timers: correctly offset mod_timer. [541/622] lustre: ptlrpc: update wiretest for new values [542/622] lustre: ptlrpc: do lu_env_refill for any new request [543/622] lustre: obd: perform proper division [544/622] lustre: uapi: introduce OBD_CONNECT2_CRUSH [545/622] lnet: Wait for single discovery attempt of routers [546/622] lustre: mgc: config lock leak [547/622] lnet: check if current->nsproxy is NULL before using [548/622] lustre: ptlrpc: always reset generation for idle reconnect [549/622] lustre: obdclass: Allow read-ahead for write requests [550/622] lustre: ldlm: separate buckets from ldlm hash table [551/622] lustre: llite: don't cache MDS_OPEN_LOCK for volatile files [552/622] lnet: discard lnd_refcount [553/622] lnet: socklnd: rename struct ksock_peer to struct ksock_peer_ni [554/622] lnet: change ksocknal_create_peer() to return pointer [555/622] lnet: discard ksnn_lock [556/622] lnet: discard LNetMEInsert [557/622] lustre: lmv: fix to return correct MDT count [558/622] lustre: obdclass: remove assertion for imp_refcount [559/622] lnet: Prefer route specified by rtr_nid [560/622] lustre: all: prefer sizeof(var) for alloc [561/622] lustre: handle: discard OBD_FREE_RCU [562/622] lnet: use list_move where appropriate. [563/622] lnet: libcfs: provide an scnprintf and start using it [564/622] lustre: llite: fetch default layout for a directory [565/622] lnet: fix rspt counter [566/622] lustre: ldlm: add a counter to the per-namespace data [567/622] lnet: Add peer level aliveness information [568/622] lnet: always check return of try_module_get() [569/622] lustre: obdclass: don't skip records for wrapped catalog [570/622] lnet: Refactor lnet_find_best_lpni_on_net [571/622] lnet: Avoid comparing route to itself [572/622] lustre: sysfs: use string helper like functions for sysfs [573/622] lustre: rename ops to owner [574/622] lustre: ldlm: simplify ldlm_ns_hash_defs[] [575/622] lnet: prepare to make lnet_lnd const. [576/622] lnet: discard struct ksock_peer [577/622] lnet: Avoid extra lnet_remotenet lookup [578/622] lnet: Remove unused vars in lnet_find_route_locked [579/622] lnet: Refactor lnet_compare_routes [580/622] lustre: u_object: factor out extra per-bucket data [581/622] lustre: llite: replace lli_trunc_sem [582/622] lnet: Fix source specified route selection [583/622] lustre: uapi: turn struct lustre_nfs_fid to userland fhandle [584/622] lustre: uapi: LU-12521 llapi: add separate fsname and instance API [585/622] lnet: socklnd: initialize the_ksocklnd at compile-time. [586/622] lnet: remove locking protection ln_testprotocompat [587/622] lustre: ptlrpc: suppress connection restored message [588/622] lustre: llite: fix deadlock in ll_update_lsm_md() [589/622] lustre: ldlm: fix lock convert races [590/622] lustre: ldlm: signal vs CP callback race [591/622] lustre: uapi: properly pack data structures [592/622] lnet: peer lookup handle shutdown [593/622] lnet: lnet response entries leak [594/622] lustre: lmv: disable statahead for remote objects [595/622] lustre: llite: eviction during ll_open_cleanup() [596/622] lustre: ptlrpc: show target name in req_history [597/622] lustre: dom: check read-on-open buffer presents in reply [598/622] lustre: llite: proper names/types for offset/pages [599/622] lustre: llite: Accept EBUSY for page unaligned read [600/622] lustre: handle: remove locking from class_handle2object() [601/622] lustre: handle: use hlist for hash lists. [602/622] lustre: obdclass: convert waiting in cl_sync_io_wait(). [603/622] lnet: modules: use list_move were appropriate. [604/622] lnet: fix small race in unloading klnd modules. [605/622] lnet: me: discard struct lnet_handle_me [606/622] lnet: avoid extra memory consumption [607/622] lustre: uapi: remove unused LUSTRE_DIRECTIO_FL [608/622] lustre: lustre: Reserve OST_FALLOCATE(fallocate) opcode [609/622] lnet: libcfs: Cleanup use of bare printk [610/622] lnet: Do not assume peers are MR capable [611/622] lnet: socklnd: convert peers hash table to hashtable.h [612/622] lustre: llite: Update mdc and lite stats on open\|creat [613/622] lustre: osc: glimpse and lock cancel race [614/622] lustre: llog: keep llog handle alive until last reference [615/622] lnet: handling device failure by IB event handler [616/622] lustre: ptlrpc: simplify wait_event handling in unregister functions [617/622] lustre: ptlrpc: use l_wait_event_abortable in ptlrpcd_add_reg() [618/622] lnet: use LIST_HEAD() for local lists. [619/622] lustre: lustre: use LIST_HEAD() for local lists. [620/622] lustre: handle: discard h_lock. [621/622] lnet: remove lnd_query interface. [622/622] lnet: use conservative health timeouts

Message ID

1582838290-17243-83-git-send-email-jsimmons@infradead.org (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E710F246A0
From: James Simmons <jsimmons@infradead.org>
To: Andreas Dilger <adilger@whamcloud.com>, Oleg Drokin <green@whamcloud.com>,
 NeilBrown <neilb@suse.de>
Date: Thu, 27 Feb 2020 16:09:10 -0500
Message-Id: <1582838290-17243-83-git-send-email-jsimmons@infradead.org>
In-Reply-To: <1582838290-17243-1-git-send-email-jsimmons@infradead.org>
References: <1582838290-17243-1-git-send-email-jsimmons@infradead.org>
Subject: [lustre-devel] [PATCH 082/622] lnet: handle remote errors in LNet
Precedence: list
Cc: Amir Shehata <ashehata@whamcloud.com>,
 Lustre Development List <lustre-devel@lists.lustre.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: lustre-devel-bounces@lists.lustre.org
Sender: "lustre-devel" <lustre-devel-bounces@lists.lustre.org>

Series

lustre: sync closely to 2.13.52 | expand

Commit Message

James Simmons Feb. 27, 2020, 9:09 p.m. UTC

From: Amir Shehata <ashehata@whamcloud.com>

Add health value in the peer NI structure. Decrement the
value whenever there is an error sending to the peer.
Modify the selection algorithm to look at the peer NI health
value when selecting the best peer NI to send to.

Put the peer NI on the recovery queue whenever there is
an error sending to it. Attempt only to resend on REMOTE
DROPPED since we're sure the message was never received by
the peer. For other errors finalize the message.

WC-bug-id: https://jira.whamcloud.com/browse/LU-9120
Lustre-commit: 76fad19c2dea ("LU-9120 lnet: handle remote errors in LNet")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/32767
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Sonia Sharma <sharmaso@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 include/linux/lnet/lib-lnet.h  |   6 +
 include/linux/lnet/lib-types.h |  12 ++
 net/lnet/lnet/api-ni.c         |   1 +
 net/lnet/lnet/lib-move.c       | 311 +++++++++++++++++++++++++++++++++++------
 net/lnet/lnet/lib-msg.c        |  87 ++++++++++--
 net/lnet/lnet/peer.c           |   9 ++
 6 files changed, 368 insertions(+), 58 deletions(-)

diff --git a/include/linux/lnet/lib-lnet.h b/include/linux/lnet/lib-lnet.h
index 965fc5f..b8ca114 100644
--- a/include/linux/lnet/lib-lnet.h
+++ b/include/linux/lnet/lib-lnet.h
@@ -894,6 +894,12 @@  int lnet_get_peer_ni_info(u32 peer_index, u64 *nid,
 	return false;
 }
 
+static inline void
+lnet_inc_healthv(atomic_t *healthv)
+{
+	atomic_add_unless(healthv, 1, LNET_MAX_HEALTH_VALUE);
+}
+
 void lnet_incr_stats(struct lnet_element_stats *stats,
 		     enum lnet_msg_type msg_type,
 		     enum lnet_stats_type stats_type);
diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h
index 8c3bf34..19b83a4 100644
--- a/include/linux/lnet/lib-types.h
+++ b/include/linux/lnet/lib-types.h
@@ -478,6 +478,8 @@  struct lnet_peer_ni {
 	struct list_head	 lpni_peer_nis;
 	/* chain on remote peer list */
 	struct list_head	 lpni_on_remote_peer_ni_list;
+	/* chain on recovery queue */
+	struct list_head	 lpni_recovery;
 	/* chain on peer hash */
 	struct list_head	 lpni_hashlist;
 	/* messages blocking for tx credits */
@@ -529,6 +531,10 @@  struct lnet_peer_ni {
 	lnet_nid_t		 lpni_nid;
 	/* # refs */
 	atomic_t		 lpni_refcount;
+	/* health value for the peer */
+	atomic_t		 lpni_healthv;
+	/* recovery ping mdh */
+	struct lnet_handle_md	 lpni_recovery_ping_mdh;
 	/* CPT this peer attached on */
 	int			 lpni_cpt;
 	/* state flags -- protected by lpni_lock */
@@ -558,6 +564,10 @@  struct lnet_peer_ni {
 
 /* Preferred path added due to traffic on non-MR peer_ni */
 #define LNET_PEER_NI_NON_MR_PREF	BIT(0)
+/* peer is being recovered. */
+#define LNET_PEER_NI_RECOVERY_PENDING	BIT(1)
+/* peer is being deleted */
+#define LNET_PEER_NI_DELETING		BIT(2)
 
 struct lnet_peer {
 	/* chain on pt_peer_list */
@@ -1088,6 +1098,8 @@  struct lnet {
 	struct list_head		**ln_mt_resendqs;
 	/* local NIs to recover */
 	struct list_head		ln_mt_localNIRecovq;
+	/* local NIs to recover */
+	struct list_head		ln_mt_peerNIRecovq;
 	/* recovery eq handler */
 	struct lnet_handle_eq		ln_mt_eqh;
 
diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c
index deef404..97d9be5 100644
--- a/net/lnet/lnet/api-ni.c
+++ b/net/lnet/lnet/api-ni.c
@@ -832,6 +832,7 @@  struct lnet_libhandle *
 	INIT_LIST_HEAD(&the_lnet.ln_dc_working);
 	INIT_LIST_HEAD(&the_lnet.ln_dc_expired);
 	INIT_LIST_HEAD(&the_lnet.ln_mt_localNIRecovq);
+	INIT_LIST_HEAD(&the_lnet.ln_mt_peerNIRecovq);
 	init_waitqueue_head(&the_lnet.ln_dc_waitq);
 
 	rc = lnet_descriptor_setup();
diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c
index f3f4b84..5224490 100644
--- a/net/lnet/lnet/lib-move.c
+++ b/net/lnet/lnet/lib-move.c
@@ -1025,15 +1025,6 @@  void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 	}
 
 	if (txpeer) {
-		/*
-		 * TODO:
-		 * Once the patch for the health comes in we need to set
-		 * the health of the peer ni to bad when we fail to send
-		 * a message.
-		 * int status = msg->msg_ev.status;
-		 * if (status != 0)
-		 *	lnet_set_peer_ni_health_locked(txpeer, false)
-		 */
 		msg->msg_txpeer = NULL;
 		lnet_peer_ni_decref_locked(txpeer);
 	}
@@ -1545,6 +1536,8 @@  void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 	int best_lpni_credits = INT_MIN;
 	bool preferred = false;
 	bool ni_is_pref;
+	int best_lpni_healthv = 0;
+	int lpni_healthv;
 
 	while ((lpni = lnet_get_next_peer_ni_locked(peer, peer_net, lpni))) {
 		/* if the best_ni we've chosen aleady has this lpni
@@ -1553,6 +1546,8 @@  void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 		ni_is_pref = lnet_peer_is_pref_nid_locked(lpni,
 							  best_ni->ni_nid);
 
+		lpni_healthv = atomic_read(&lpni->lpni_healthv);
+
 		CDEBUG(D_NET, "%s ni_is_pref = %d\n",
 		       libcfs_nid2str(best_ni->ni_nid), ni_is_pref);
 
@@ -1562,8 +1557,13 @@  void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats,
 			       lpni->lpni_txcredits, best_lpni_credits,
 			       lpni->lpni_seq, best_lpni->lpni_seq);
 
+		/* pick the healthiest peer ni */
+		if (lpni_healthv < best_lpni_healthv) {
+			continue;
+		} else if (lpni_healthv > best_lpni_healthv) {
+			best_lpni_healthv = lpni_healthv;
 		/* if this is a preferred peer use it */
-		if (!preferred && ni_is_pref) {
+		} else if (!preferred && ni_is_pref) {
 			preferred = true;
 		} else if (preferred && !ni_is_pref) {
 			/*
@@ -2408,6 +2408,16 @@  struct lnet_ni *
 	return 0;
 }
 
+enum lnet_mt_event_type {
+	MT_TYPE_LOCAL_NI = 0,
+	MT_TYPE_PEER_NI
+};
+
+struct lnet_mt_event_info {
+	enum lnet_mt_event_type mt_type;
+	lnet_nid_t mt_nid;
+};
+
 static void
 lnet_resend_pending_msgs_locked(struct list_head *resendq, int cpt)
 {
@@ -2503,6 +2513,7 @@  struct lnet_ni *
 static void
 lnet_recover_local_nis(void)
 {
+	struct lnet_mt_event_info *ev_info;
 	struct list_head processed_list;
 	struct list_head local_queue;
 	struct lnet_handle_md mdh;
@@ -2550,15 +2561,24 @@  struct lnet_ni *
 		lnet_ni_unlock(ni);
 		lnet_net_unlock(0);
 
-		/* protect the ni->ni_state field. Once we call the
-		 * lnet_send_ping function it's possible we receive
-		 * a response before we check the rc. The lock ensures
-		 * a stable value for the ni_state RECOVERY_PENDING bit
-		 */
+		CDEBUG(D_NET, "attempting to recover local ni: %s\n",
+		       libcfs_nid2str(ni->ni_nid));
+
 		lnet_ni_lock(ni);
 		if (!(ni->ni_state & LNET_NI_STATE_RECOVERY_PENDING)) {
 			ni->ni_state |= LNET_NI_STATE_RECOVERY_PENDING;
 			lnet_ni_unlock(ni);
+
+			ev_info = kzalloc(sizeof(*ev_info), GFP_NOFS);
+			if (!ev_info) {
+				CERROR("out of memory. Can't recover %s\n",
+				       libcfs_nid2str(ni->ni_nid));
+				lnet_ni_lock(ni);
+				ni->ni_state &= ~LNET_NI_STATE_RECOVERY_PENDING;
+				lnet_ni_unlock(ni);
+				continue;
+			}
+
 			mdh = ni->ni_ping_mdh;
 			/* Invalidate the ni mdh in case it's deleted.
 			 * We'll unlink the mdh in this case below.
@@ -2587,9 +2607,10 @@  struct lnet_ni *
 			lnet_ni_decref_locked(ni, 0);
 			lnet_net_unlock(0);
 
-			rc = lnet_send_ping(nid, &mdh,
-					    LNET_INTERFACES_MIN, (void *)nid,
-					    the_lnet.ln_mt_eqh, true);
+			ev_info->mt_type = MT_TYPE_LOCAL_NI;
+			ev_info->mt_nid = nid;
+			rc = lnet_send_ping(nid, &mdh, LNET_INTERFACES_MIN,
+					    ev_info, the_lnet.ln_mt_eqh, true);
 			/* lookup the nid again */
 			lnet_net_lock(0);
 			ni = lnet_nid2ni_locked(nid, 0);
@@ -2694,6 +2715,44 @@  struct lnet_ni *
 }
 
 static void
+lnet_unlink_lpni_recovery_mdh_locked(struct lnet_peer_ni *lpni, int cpt)
+{
+	struct lnet_handle_md recovery_mdh;
+
+	LNetInvalidateMDHandle(&recovery_mdh);
+
+	if (lpni->lpni_state & LNET_PEER_NI_RECOVERY_PENDING) {
+		recovery_mdh = lpni->lpni_recovery_ping_mdh;
+		LNetInvalidateMDHandle(&lpni->lpni_recovery_ping_mdh);
+	}
+	spin_unlock(&lpni->lpni_lock);
+	lnet_net_unlock(cpt);
+	if (!LNetMDHandleIsInvalid(recovery_mdh))
+		LNetMDUnlink(recovery_mdh);
+	lnet_net_lock(cpt);
+	spin_lock(&lpni->lpni_lock);
+}
+
+static void
+lnet_clean_peer_ni_recoveryq(void)
+{
+	struct lnet_peer_ni *lpni, *tmp;
+
+	lnet_net_lock(LNET_LOCK_EX);
+
+	list_for_each_entry_safe(lpni, tmp, &the_lnet.ln_mt_peerNIRecovq,
+				 lpni_recovery) {
+		list_del_init(&lpni->lpni_recovery);
+		spin_lock(&lpni->lpni_lock);
+		lnet_unlink_lpni_recovery_mdh_locked(lpni, LNET_LOCK_EX);
+		spin_unlock(&lpni->lpni_lock);
+		lnet_peer_ni_decref_locked(lpni);
+	}
+
+	lnet_net_unlock(LNET_LOCK_EX);
+}
+
+static void
 lnet_clean_resendqs(void)
 {
 	struct lnet_msg *msg, *tmp;
@@ -2716,6 +2775,128 @@  struct lnet_ni *
 	cfs_percpt_free(the_lnet.ln_mt_resendqs);
 }
 
+static void
+lnet_recover_peer_nis(void)
+{
+	struct lnet_mt_event_info *ev_info;
+	struct list_head processed_list;
+	struct list_head local_queue;
+	struct lnet_handle_md mdh;
+	struct lnet_peer_ni *lpni;
+	struct lnet_peer_ni *tmp;
+	lnet_nid_t nid;
+	int healthv;
+	int rc;
+
+	INIT_LIST_HEAD(&local_queue);
+	INIT_LIST_HEAD(&processed_list);
+
+	/* Always use cpt 0 for locking across all interactions with
+	 * ln_mt_peerNIRecovq
+	 */
+	lnet_net_lock(0);
+	list_splice_init(&the_lnet.ln_mt_peerNIRecovq,
+			 &local_queue);
+	lnet_net_unlock(0);
+
+	list_for_each_entry_safe(lpni, tmp, &local_queue,
+				 lpni_recovery) {
+		/* The same protection strategy is used here as is in the
+		 * local recovery case.
+		 */
+		lnet_net_lock(0);
+		healthv = atomic_read(&lpni->lpni_healthv);
+		spin_lock(&lpni->lpni_lock);
+		if (lpni->lpni_state & LNET_PEER_NI_DELETING ||
+		    healthv == LNET_MAX_HEALTH_VALUE) {
+			list_del_init(&lpni->lpni_recovery);
+			lnet_unlink_lpni_recovery_mdh_locked(lpni, 0);
+			spin_unlock(&lpni->lpni_lock);
+			lnet_peer_ni_decref_locked(lpni);
+			lnet_net_unlock(0);
+			continue;
+		}
+		spin_unlock(&lpni->lpni_lock);
+		lnet_net_unlock(0);
+
+		/* NOTE: we're racing with peer deletion from user space.
+		 * It's possible that a peer is deleted after we check its
+		 * state. In this case the recovery can create a new peer
+		 */
+		spin_lock(&lpni->lpni_lock);
+		if (!(lpni->lpni_state & LNET_PEER_NI_RECOVERY_PENDING) &&
+		    !(lpni->lpni_state & LNET_PEER_NI_DELETING)) {
+			lpni->lpni_state |= LNET_PEER_NI_RECOVERY_PENDING;
+			spin_unlock(&lpni->lpni_lock);
+
+			ev_info = kzalloc(sizeof(*ev_info), GFP_NOFS);
+			if (!ev_info) {
+				CERROR("out of memory. Can't recover %s\n",
+				       libcfs_nid2str(lpni->lpni_nid));
+				spin_lock(&lpni->lpni_lock);
+				lpni->lpni_state &=
+					~LNET_PEER_NI_RECOVERY_PENDING;
+				spin_unlock(&lpni->lpni_lock);
+				continue;
+			}
+
+			/* look at the comments in lnet_recover_local_nis() */
+			mdh = lpni->lpni_recovery_ping_mdh;
+			LNetInvalidateMDHandle(&lpni->lpni_recovery_ping_mdh);
+			nid = lpni->lpni_nid;
+			lnet_net_lock(0);
+			list_del_init(&lpni->lpni_recovery);
+			lnet_peer_ni_decref_locked(lpni);
+			lnet_net_unlock(0);
+
+			ev_info->mt_type = MT_TYPE_PEER_NI;
+			ev_info->mt_nid = nid;
+			rc = lnet_send_ping(nid, &mdh, LNET_INTERFACES_MIN,
+					    ev_info, the_lnet.ln_mt_eqh, true);
+			lnet_net_lock(0);
+			/* lnet_find_peer_ni_locked() grabs a refcount for
+			 * us. No need to take it explicitly.
+			 */
+			lpni = lnet_find_peer_ni_locked(nid);
+			if (!lpni) {
+				lnet_net_unlock(0);
+				LNetMDUnlink(mdh);
+				continue;
+			}
+
+			lpni->lpni_recovery_ping_mdh = mdh;
+			/* While we're unlocked the lpni could've been
+			 * readded on the recovery queue. In this case we
+			 * don't need to add it to the local queue, since
+			 * it's already on there and the thread that added
+			 * it would've incremented the refcount on the
+			 * peer, which means we need to decref the refcount
+			 * that was implicitly grabbed by find_peer_ni_locked.
+			 * Otherwise, if the lpni is still not on
+			 * the recovery queue, then we'll add it to the
+			 * processed list.
+			 */
+			if (list_empty(&lpni->lpni_recovery))
+				list_add_tail(&lpni->lpni_recovery,
+					      &processed_list);
+			else
+				lnet_peer_ni_decref_locked(lpni);
+			lnet_net_unlock(0);
+
+			spin_lock(&lpni->lpni_lock);
+			if (rc)
+				lpni->lpni_state &=
+					~LNET_PEER_NI_RECOVERY_PENDING;
+		}
+		spin_unlock(&lpni->lpni_lock);
+	}
+
+	list_splice_init(&processed_list, &local_queue);
+	lnet_net_lock(0);
+	list_splice(&local_queue, &the_lnet.ln_mt_peerNIRecovq);
+	lnet_net_unlock(0);
+}
+
 static int
 lnet_monitor_thread(void *arg)
 {
@@ -2736,6 +2917,8 @@  struct lnet_ni *
 
 		lnet_recover_local_nis();
 
+		lnet_recover_peer_nis();
+
 		/* TODO do we need to check if we should sleep without
 		 * timeout?  Technically, an active system will always
 		 * have messages in flight so this check will always
@@ -2822,10 +3005,61 @@  struct lnet_ni *
 }
 
 static void
+lnet_handle_recovery_reply(struct lnet_mt_event_info *ev_info,
+			   int status)
+{
+	lnet_nid_t nid = ev_info->mt_nid;
+
+	if (ev_info->mt_type == MT_TYPE_LOCAL_NI) {
+		struct lnet_ni *ni;
+
+		lnet_net_lock(0);
+		ni = lnet_nid2ni_locked(nid, 0);
+		if (!ni) {
+			lnet_net_unlock(0);
+			return;
+		}
+		lnet_ni_lock(ni);
+		ni->ni_state &= ~LNET_NI_STATE_RECOVERY_PENDING;
+		lnet_ni_unlock(ni);
+		lnet_net_unlock(0);
+
+		if (status != 0) {
+			CERROR("local NI recovery failed with %d\n", status);
+			return;
+		}
+		/* need to increment healthv for the ni here, because in
+		 * the lnet_finalize() path we don't have access to this
+		 * NI. And in order to get access to it, we'll need to
+		 * carry forward too much information.
+		 * In the peer case, it'll naturally be incremented
+		 */
+		lnet_inc_healthv(&ni->ni_healthv);
+	} else {
+		struct lnet_peer_ni *lpni;
+		int cpt;
+
+		cpt = lnet_net_lock_current();
+		lpni = lnet_find_peer_ni_locked(nid);
+		if (!lpni) {
+			lnet_net_unlock(cpt);
+			return;
+		}
+		spin_lock(&lpni->lpni_lock);
+		lpni->lpni_state &= ~LNET_PEER_NI_RECOVERY_PENDING;
+		spin_unlock(&lpni->lpni_lock);
+		lnet_peer_ni_decref_locked(lpni);
+		lnet_net_unlock(cpt);
+
+		if (status != 0)
+			CERROR("peer NI recovery failed with %d\n", status);
+	}
+}
+
+static void
 lnet_mt_event_handler(struct lnet_event *event)
 {
-	lnet_nid_t nid = (lnet_nid_t)event->md.user_ptr;
-	struct lnet_ni *ni;
+	struct lnet_mt_event_info *ev_info = event->md.user_ptr;
 	struct lnet_ping_buffer *pbuf;
 
 	/* TODO: remove assert */
@@ -2837,37 +3071,25 @@  struct lnet_ni *
 	       event->status);
 
 	switch (event->type) {
+	case LNET_EVENT_UNLINK:
+		CDEBUG(D_NET, "%s recovery ping unlinked\n",
+		       libcfs_nid2str(ev_info->mt_nid));
+		/* fall-through */
 	case LNET_EVENT_REPLY:
-		/* If the NI has been restored completely then remove from
-		 * the recovery queue
-		 */
-		lnet_net_lock(0);
-		ni = lnet_nid2ni_locked(nid, 0);
-		if (!ni) {
-			lnet_net_unlock(0);
-			break;
-		}
-		lnet_ni_lock(ni);
-		ni->ni_state &= ~LNET_NI_STATE_RECOVERY_PENDING;
-		lnet_ni_unlock(ni);
-		lnet_net_unlock(0);
+		lnet_handle_recovery_reply(ev_info, event->status);
 		break;
 	case LNET_EVENT_SEND:
 		CDEBUG(D_NET, "%s recovery message sent %s:%d\n",
-		       libcfs_nid2str(nid),
+		       libcfs_nid2str(ev_info->mt_nid),
 		       (event->status) ? "unsuccessfully" :
 		       "successfully", event->status);
 		break;
-	case LNET_EVENT_UNLINK:
-		/* nothing to do */
-		CDEBUG(D_NET, "%s recovery ping unlinked\n",
-		       libcfs_nid2str(nid));
-		break;
 	default:
 		CERROR("Unexpected event: %d\n", event->type);
-		return;
+		break;
 	}
 	if (event->unlinked) {
+		kfree(ev_info);
 		pbuf = LNET_PING_INFO_TO_BUFFER(event->md.start);
 		lnet_ping_buffer_decref(pbuf);
 	}
@@ -2919,14 +3141,16 @@  int lnet_monitor_thr_start(void)
 	lnet_router_cleanup();
 free_mem:
 	the_lnet.ln_mt_state = LNET_MT_STATE_SHUTDOWN;
-	lnet_clean_resendqs();
 	lnet_clean_local_ni_recoveryq();
+	lnet_clean_peer_ni_recoveryq();
+	lnet_clean_resendqs();
 	LNetEQFree(the_lnet.ln_mt_eqh);
 	LNetInvalidateEQHandle(&the_lnet.ln_mt_eqh);
 	return rc;
 clean_queues:
-	lnet_clean_resendqs();
 	lnet_clean_local_ni_recoveryq();
+	lnet_clean_peer_ni_recoveryq();
+	lnet_clean_resendqs();
 	return rc;
 }
 
@@ -2949,8 +3173,9 @@  void lnet_monitor_thr_stop(void)
 
 	/* perform cleanup tasks */
 	lnet_router_cleanup();
-	lnet_clean_resendqs();
 	lnet_clean_local_ni_recoveryq();
+	lnet_clean_peer_ni_recoveryq();
+	lnet_clean_resendqs();
 	rc = LNetEQFree(the_lnet.ln_mt_eqh);
 	LASSERT(rc == 0);
 }
diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index e7f7469..046923b 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -482,12 +482,6 @@ 
 	}
 }
 
-static inline void
-lnet_inc_healthv(atomic_t *healthv)
-{
-	atomic_add_unless(healthv, 1, LNET_MAX_HEALTH_VALUE);
-}
-
 static void
 lnet_handle_local_failure(struct lnet_msg *msg)
 {
@@ -524,6 +518,43 @@ 
 	lnet_net_unlock(0);
 }
 
+static void
+lnet_handle_remote_failure(struct lnet_msg *msg)
+{
+	struct lnet_peer_ni *lpni;
+
+	lpni = msg->msg_txpeer;
+
+	/* lpni could be NULL if we're in the LOLND case */
+	if (!lpni)
+		return;
+
+	lnet_net_lock(0);
+	/* the mt could've shutdown and cleaned up the queues */
+	if (the_lnet.ln_mt_state != LNET_MT_STATE_RUNNING) {
+		lnet_net_unlock(0);
+		return;
+	}
+
+	lnet_dec_healthv_locked(&lpni->lpni_healthv);
+	/* add the peer NI to the recovery queue if it's not already there
+	 * and it's health value is actually below the maximum. It's
+	 * possible that the sensitivity might be set to 0, and the health
+	 * value will not be reduced. In this case, there is no reason to
+	 * invoke recovery
+	 */
+	if (list_empty(&lpni->lpni_recovery) &&
+	    atomic_read(&lpni->lpni_healthv) < LNET_MAX_HEALTH_VALUE) {
+		CERROR("lpni %s added to recovery queue. Health = %d\n",
+		       libcfs_nid2str(lpni->lpni_nid),
+		       atomic_read(&lpni->lpni_healthv));
+		list_add_tail(&lpni->lpni_recovery,
+			      &the_lnet.ln_mt_peerNIRecovq);
+		lnet_peer_ni_addref_locked(lpni);
+	}
+	lnet_net_unlock(0);
+}
+
 /* Do a health check on the message:
  * return -1 if we're not going to handle the error
  *   success case will return -1 as well
@@ -533,11 +564,20 @@ 
 lnet_health_check(struct lnet_msg *msg)
 {
 	enum lnet_msg_hstatus hstatus = msg->msg_health_status;
+	bool lo = false;
 
 	/* TODO: lnet_incr_hstats(hstatus); */
 
 	LASSERT(msg->msg_txni);
 
+	/* if we're sending to the LOLND then the msg_txpeer will not be
+	 * set. So no need to sanity check it.
+	 */
+	if (LNET_NETTYP(LNET_NIDNET(msg->msg_txni->ni_nid)) != LOLND)
+		LASSERT(msg->msg_txpeer);
+	else
+		lo = true;
+
 	if (hstatus != LNET_MSG_STATUS_OK &&
 	    ktime_compare(ktime_get(), msg->msg_deadline) >= 0)
 		return -1;
@@ -546,9 +586,21 @@ 
 	if (the_lnet.ln_state != LNET_STATE_RUNNING)
 		return -1;
 
+	CDEBUG(D_NET, "health check: %s->%s: %s: %s\n",
+	       libcfs_nid2str(msg->msg_txni->ni_nid),
+	       (lo) ? "self" : libcfs_nid2str(msg->msg_txpeer->lpni_nid),
+	       lnet_msgtyp2str(msg->msg_type),
+	       lnet_health_error2str(hstatus));
+
 	switch (hstatus) {
 	case LNET_MSG_STATUS_OK:
 		lnet_inc_healthv(&msg->msg_txni->ni_healthv);
+		/* It's possible msg_txpeer is NULL in the LOLND
+		 * case.
+		 */
+		if (msg->msg_txpeer)
+			lnet_inc_healthv(&msg->msg_txpeer->lpni_healthv);
+
 		/* we can finalize this message */
 		return -1;
 	case LNET_MSG_STATUS_LOCAL_INTERRUPT:
@@ -560,22 +612,27 @@ 
 		/* add to the re-send queue */
 		goto resend;
 
-		/* TODO: since the remote dropped the message we can
-		 * attempt a resend safely.
-		 */
-	case LNET_MSG_STATUS_REMOTE_DROPPED:
-		break;
-
-		/* These errors will not trigger a resend so simply
-		 * finalize the message
-		 */
+	/* These errors will not trigger a resend so simply
+	 * finalize the message
+	 */
 	case LNET_MSG_STATUS_LOCAL_ERROR:
 		lnet_handle_local_failure(msg);
 		return -1;
+
+	/* TODO: since the remote dropped the message we can
+	 * attempt a resend safely.
+	 */
+	case LNET_MSG_STATUS_REMOTE_DROPPED:
+		lnet_handle_remote_failure(msg);
+		goto resend;
+
 	case LNET_MSG_STATUS_REMOTE_ERROR:
 	case LNET_MSG_STATUS_REMOTE_TIMEOUT:
 	case LNET_MSG_STATUS_NETWORK_TIMEOUT:
+		lnet_handle_remote_failure(msg);
 		return -1;
+	default:
+		LBUG();
 	}
 
 resend:
diff --git a/net/lnet/lnet/peer.c b/net/lnet/lnet/peer.c
index 121876e..4a62f9a 100644
--- a/net/lnet/lnet/peer.c
+++ b/net/lnet/lnet/peer.c
@@ -124,6 +124,7 @@ 
 	INIT_LIST_HEAD(&lpni->lpni_routes);
 	INIT_LIST_HEAD(&lpni->lpni_hashlist);
 	INIT_LIST_HEAD(&lpni->lpni_peer_nis);
+	INIT_LIST_HEAD(&lpni->lpni_recovery);
 	INIT_LIST_HEAD(&lpni->lpni_on_remote_peer_ni_list);
 
 	spin_lock_init(&lpni->lpni_lock);
@@ -133,6 +134,7 @@ 
 	lpni->lpni_ping_feats = LNET_PING_FEAT_INVAL;
 	lpni->lpni_nid = nid;
 	lpni->lpni_cpt = cpt;
+	atomic_set(&lpni->lpni_healthv, LNET_MAX_HEALTH_VALUE);
 	lnet_set_peer_ni_health_locked(lpni, true);
 
 	net = lnet_get_net_locked(LNET_NIDNET(nid));
@@ -331,6 +333,13 @@ 
 	/* remove peer ni from the hash list. */
 	list_del_init(&lpni->lpni_hashlist);
 
+	/* indicate the peer is being deleted so the monitor thread can
+	 * remove it from the recovery queue.
+	 */
+	spin_lock(&lpni->lpni_lock);
+	lpni->lpni_state |= LNET_PEER_NI_DELETING;
+	spin_unlock(&lpni->lpni_lock);
+
 	/* decrement the ref count on the peer table */
 	ptable = the_lnet.ln_peer_tables[lpni->lpni_cpt];
 	LASSERT(atomic_read(&ptable->pt_number) > 0);

[082/622] lnet: handle remote errors in LNet

Commit Message

Patch