From patchwork Tue Dec 13 01:34:03 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Willem Jan Withagen X-Patchwork-Id: 9471709 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 85EE8607EE for ; Tue, 13 Dec 2016 01:34:14 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 768F22854E for ; Tue, 13 Dec 2016 01:34:14 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 6B42328554; Tue, 13 Dec 2016 01:34:14 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id F2C9728550 for ; Tue, 13 Dec 2016 01:34:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753073AbcLMBeL (ORCPT ); Mon, 12 Dec 2016 20:34:11 -0500 Received: from smtp.digiware.nl ([176.74.240.9]:11428 "EHLO smtp.digiware.nl" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752578AbcLMBeK (ORCPT ); Mon, 12 Dec 2016 20:34:10 -0500 Received: from router.digiware.nl (localhost.digiware.nl [127.0.0.1]) by smtp.digiware.nl (Postfix) with ESMTP id 32514337D6; Tue, 13 Dec 2016 02:34:06 +0100 (CET) X-Virus-Scanned: amavisd-new at digiware.com Received: from smtp.digiware.nl ([127.0.0.1]) by router.digiware.nl (router.digiware.nl [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id hlpDuG7T-hz4; Tue, 13 Dec 2016 02:34:04 +0100 (CET) Received: from [192.168.10.10] (asus [192.168.10.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.digiware.nl (Postfix) with ESMTPSA id 31FB9337D3; Tue, 13 Dec 2016 02:34:04 +0100 (CET) Subject: Re: Mismatching nonce for 'ceph osd.0 tell' To: kefu chai References: <14688c37-068c-0bc6-615d-de2f8f5bfdc6@digiware.nl> <3df7e2ed-eb34-8ba8-8a8d-9d7b66e45315@digiware.nl> <89a9ee77-21db-e955-7bb3-c4adc3bb94bf@digiware.nl> <390636b2-3bcf-a3c0-1d50-9e62e53bbda8@digiware.nl> <822ea89a-d556-9ff1-5302-115ef05890cc@digiware.nl> <1a42b836-94c8-b265-66af-d2b563cc9c7d@digiware.nl> <87ad4e8e-a94d-079b-5c84-fa61849dbffc@digiware.nl> Cc: Gregory Farnum , Haomai Wang , Ceph Development From: Willem Jan Withagen Message-ID: Date: Tue, 13 Dec 2016 02:34:03 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.5.1 MIME-Version: 1.0 In-Reply-To: <87ad4e8e-a94d-079b-5c84-fa61849dbffc@digiware.nl> Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On 13-12-2016 02:02, Willem Jan Withagen wrote: > On 10-12-2016 01:39, Willem Jan Withagen wrote: >> On 9-12-2016 13:02, Willem Jan Withagen wrote: >>> On 9-12-2016 10:22, Willem Jan Withagen wrote: >>>> On 9-12-2016 09:59, kefu chai wrote: >>>>> On Thu, Dec 8, 2016 at 8:30 PM, Willem Jan Withagen wrote: >>>>>> On 8-12-2016 11:03, kefu chai wrote: >>>>>>> On Tue, Oct 4, 2016 at 7:57 PM, Willem Jan Withagen wrote: >>>>>>>> On 3-10-2016 19:50, Gregory Farnum wrote: >>>>>>>>>> Question here is: >>>>>>>>>> If I ask 'ceph osd dump', I'm actually asking ceph-mon. >>>>>>>>>> And cehp-mon has learned this from (crush?)maps being sent to it by >>>>>>>>>> ceph-osd. >>>>>>>>> The monitor has learned about specific IP addresses/nonces/etc via >>>>>>>>> MOSDBoot messages from the OSDs. The crush locations are set via >>>>>>>>> monitor command messages, generally invoked as part of the init >>>>>>>>> scripts. Maps are generated entirely on the monitor. :) >>>>>>>>> >>>>>>>>>> Is there an easy way to debug/monitor the content of what ceph-osd sends >>>>>>>>>> and ceph-mon receives in the maps? >>>>>>>>>> Just to make sure that it is clear where the problem occurs. >>>>>>>>> You should be able to see the info going in and out by bumping the >>>>>>>>> debug levels up — every message's "print" function is invoked when >>>>>>>>> it's sent/received as long as you have "debug ms = 1". It looks like >>>>>>>>> the MOSDBoot message doesn't natively dump its addresses but you can >>>>>>>>> add them easily if you need to. >>>>>>>> Hi Greg, >>>>>>>> >>>>>>>> Thanx for the answer.... >>>>>>>> >>>>>>>> I've got debug_ms already pumped up all the way to 20. >>>>>>>> So I do get to see what addresses are selected during bind. But still >>>>>>>> they do not end up at the MON, and 'ceph osd dump' reports: >>>>>>>> :/0 >>>>>>>> as bind address. >>>>>>>> >>>>>>>> I'm going to add some more debugs to actually see what MOSDBoot is doing.... >>>>>>> there are multiple messengers used by ceph-osd, the one connected by >>>>>>> rados client is the external/public messenger. it is also used by osd >>>>>>> to talk with the monitor. >>>>>>> >>>>>>> the nonce of the external address of an OSD does not change after it's >>>>>>> up: it's always the pid of ceph-osd process. and the (peer) address of >>>>>>> the booting OSD collected by monitor comes from the connection's >>>>>>> peer_addr field, which is set when the monitor accepts the connection >>>>>>> from OSD. see STATE_ACCEPTING_WAIT_BANNER_ADDR case block in >>>>>>> AsyncConnection::_process_connection(). >>>>>>> >>>>>>> but there are chances that an OSD is restarted and fail to bind its >>>>>>> external messenger to the specified the port. in that case, ceph-osd >>>>>>> will try with another port, but keep the nonce the same. but when it >>>>>>> comes to other messengers used by ceph-osd, their nonces increase by >>>>>>> 1000000 every time they rebind. that's why "ceph osd thrash" can >>>>>>> change the nonces of the cluster_addr, heartbeat_back_addr and >>>>>>> heartbeat_front_addr. the PR of >>>>>>> https://github.com/ceph/ceph/pull/11706 actually changes the behavior >>>>>>> of the messengers of these three messengers. and it has nothing to do >>>>>>> with the external messenger to which the ceph cli client is >>>>>>> connecting. >>>>>>> >>>>>>> so you might want to check >>>>>>> 1) how/why the nonce of the messenger in MonClient is 1000000 + $pid >>>>>>> 2) while the nonce of the same messenger is $pid when the ceph cli >>>>>>> connects to it. >>>>>>> >>>>>>> my PR of https://github.com/ceph/ceph/pull/11804 is more of a cleanup. >>>>>>> it avoids setting the nonce before the rebind finishes. and i tried >>>>>>> with your producer on my linux box, no luck =( >>>>>> Right, >>>>>> >>>>>> You gave me a lot of things to think about, and to start figuring out. >>>>>> >>>>>> And you are right that something really bad needs to happen to an OSD to >>>>>> get in this state. But that is what the tests actually do: They just >>>>>> down/up or kill OSDs and restart. >>>>>> >>>>>> And from previous discussions I "learned" that if the process doesn't >>>>>> die but needs to rebind on the port, the OSD stays at the same port but >>>>>> increments the nonce to indicate that it is a fresh connection. And log >>>>> the external messenger should *not* increment its nonce. >>>>> >>>>>> printing actually shows that the code is going thru a rebind. >>>>> and it should *not* go through rebind(). >>>> I have to dig thru the testscript but as far as I can tell just about >>>> all of the daemons are getting reboots in this test. >>>> >>>> So when would I get a rebind? >>>> >>>> I thought it was because I had an OSD incorrectly marked down: >>>> ./src/osd/OSD.cc:7074: << " wrongly marked me down"; >>>> This I found in the logs, and then I got a rebind. >>>> >>>> Wido suggested looking for this message, on my question why my OSDs were >>>> not getting UP after a good hustle with all OSDs and MONs. >>>> >>>> And that is one of the tests in cephtool-test-mon.sh. >>>> right before the 'ceph tell osd.0 version' there are tests like: >>>> ceph osd set noup >>>> ceph osd down 0 >>>> ceph osd dump | grep 'osd.0 down' >>>> ceph osd unset noup >>>> and >>>> ceph osd reweight osd.0 .5 >>>> ceph osd dump | grep ^osd.0 | grep 'weight 0.5' >>>> ceph osd out 0 >>>> ceph osd in 0 >>>> ceph osd dump | grep ^osd.0 | grep 'weight 0.5' >>>> >>>> >>>>>> Now the bad thing is that the Linux and FreeBSD log do comparable things >>>>>> with my (small) change to the setting of addr. And the nonce is indeed >>>>>> incremented, which increment is actually picked up by all ceph components. >>>> So now I have 2 challenges?? >>>> >>>> 1) Find out why I get a rebind, where you think I should not. >>>> For that I'll have to collect all maltreatment that is done in >>>> cephtool-test-mon.sh. And again compare the Linux and FreeBSD logs >>>> to see what is up. >>>> 2) If we get a rebind... >>>> Why doesn't the FreeBSD version end up with consistent noncees. >>>> >>>> "Good thing" about the previous code was that I could tweak it, and at >>>> least get it to Work for FreeBSD. Have not had the time to see if I >>>> could again with this code.... >>> So the smallest sequence I can find that demonstrates the problem: >>> function test_mon_rebind() >>> { >>> ceph osd set noup >>> ceph osd down 0 >>> ceph osd dump | grep 'osd.0 down' >>> ceph osd unset noup >>> max_run=1000 >>> for ((i=0; i < $max_run; i++)); do >>> if ! ceph osd dump | grep 'osd.0 up'; then >>> echo "waiting for osd.0 to come back up ($i/$max_run)" >>> sleep 1 >>> else >>> break >>> fi >>> done >>> ceph osd dump | grep 'osd.0 up' >>> >>> for id in `ceph osd ls` ; do >>> retry_eagain 5 map_enxio_to_eagain ceph tell osd.$id version >>> done >>> } >>> >>> Which matches with what I thought I knew: >>> OSD down => up => rebind >>> which follows from the log where the osd complains about being marked >>> down incorrectly. >>> search for >>> log_channel(cluster) log [WRN] : map e8 wrongly marked me down >>> in the osd.0.log >>> >> >> I ran the code snippet, and it DOES generate a rebind even on Linux >> Centos 7. >> So I expect that the rebind action is called for? >> >> Also note that at least one nonce is incremented, AND new ports are >> selected. >> 6801,6802,6803 are avoided, and it looks like already connected ports >> are avoided as well. >> (obvious, since binding on them will not work) >> >> Next is the FreeBSD log, and that one is relatively seriously wrong >> since nonce are 0 . >> And that must be incorrect. >> >> Last is the log of the current master code with the patch I made of >> msg->my_addr, versus addr. >> IMHO a change that should be semantical equal? >> return bind(msgr->my_inst.addr, new_avoid); >> // return bind(addr, new_avoid); >> One thing that is not working there is the port avoidance. That I have >> to look into since there is no reason >> why this should not work under FreeBSD. > > I have tested with the new code Kefu submitted to master this afternoon: > The Linux variant gives: > 7f36b1dd1700 1 -- 127.0.0.1:6800/13799 <== mon.2 127.0.0.1:8204/0 38 > ==== osd_map(18..21 src has 1..21) v3 ==== 1012+0+0 (424389225 0 0) > 0x7f36cdcc7cc0 con 0x7f36cdb24000 > 7f36b5ac6700 0 log_channel(cluster) log [WRN] : map e21 wrongly marked > me down > 7f36b5ac6700 1 -- 127.0.0.1:6801/13799 rebind rebind avoid 6801,6802,6803 > 7f36b5ac6700 1 -- 127.0.0.1:6801/13799 shutdown_connections > 7f36b5ac6700 1 -- 127.0.0.1:6812/1013799 _finish_bind bind my_inst.addr > is 127.0.0.1:6812/1013799 > 7f36b5ac6700 1 Processor -- start > > Note that the new allocated channel is on port 6812 with an incremented > nonce, and the channel on port 6800 is not touched. Probably because it > is still open, and it is not the channel being rebound. > > Now the FreeBSD version: > bb99f80 1 -- 127.0.0.1:6800/6449 <== mon.1 127.0.0.1:7203/0 158 ==== > osd_map(172..175 src has 1..175) v3 ==== 1045+0+0 (1598708051 0 0) > 0xc08eb40 con 0xbbca000 > bb98d80 0 log_channel(cluster) log [WRN] : map e175 wrongly marked me down > bb98d80 1 -- 127.0.0.1:6801/6449 rebind rebind avoid 6801,6802,6803 > bb98d80 1 -- 127.0.0.1:6801/6449 shutdown_connections > bb98d80 1 -- 127.0.0.1:6800/1006449 _finish_bind bind my_inst.addr is > 127.0.0.1:6800/1006449 > bb98d80 1 Processor -- start > > And here the new channel is allocated on port 6800, as if that would be > available. Should it also not be marked as such in avoid_ports?? > But that port is still being used as is indicated in the first line. > FreeBSD does not seem to have a problem binding to that port ?? > And thus a channel is listening on 6800, but with a new nonce. > So that could be the reason that ceph and rados are complaining that > they cannot talk to osd.0 > > So why is 6800 not in the list for channels to avoid? I patched osd/OSD.cc: (added to fit with 80 chars) avoid_ports.insert(hb_front_server_messenger->get_myaddr().get_port()); And just adding this extra port to avoid gets things done. I guess that there are enough remarks to be made on this patch?? --WjW --- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html --- a/src/osd/OSD.cc +++ b/src/osd/OSD.cc @@ -7126,6 +7126,7 @@ void OSD::_committed_osd_maps(epoch_t first, epoch_t last, MOSDMap *m) start_waiting_for_healthy(); set avoid_ports; + avoid_ports.insert(client_messenger->get_myaddr().get_port()); avoid_ports.insert(cluster_messenger->get_myaddr().get_port()); avoid_ports.insert(hb_back_server_messenger->get_myaddr().get_port());