From patchwork Thu Dec 3 22:02:42 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mike Kravetz X-Patchwork-Id: 11949729 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.5 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A6B93C4361B for ; Thu, 3 Dec 2020 22:05:04 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 2C3E52220F for ; Thu, 3 Dec 2020 22:05:04 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2C3E52220F Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=oracle.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 911E86B005C; Thu, 3 Dec 2020 17:05:03 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 84D326B0068; Thu, 3 Dec 2020 17:05:03 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 69EDF6B006C; Thu, 3 Dec 2020 17:05:03 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0158.hostedemail.com [216.40.44.158]) by kanga.kvack.org (Postfix) with ESMTP id 48D366B005C for ; Thu, 3 Dec 2020 17:05:03 -0500 (EST) Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 17639181AEF30 for ; Thu, 3 Dec 2020 22:05:03 +0000 (UTC) X-FDA: 77553352086.06.veil59_500a687273bf Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin06.hostedemail.com (Postfix) with ESMTP id E6EFA10051A0D for ; Thu, 3 Dec 2020 22:05:02 +0000 (UTC) X-HE-Tag: veil59_500a687273bf X-Filterd-Recvd-Size: 7420 Received: from userp2130.oracle.com (userp2130.oracle.com [156.151.31.86]) by imf25.hostedemail.com (Postfix) with ESMTP for ; Thu, 3 Dec 2020 22:05:02 +0000 (UTC) Received: from pps.filterd (userp2130.oracle.com [127.0.0.1]) by userp2130.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 0B3Lrpwa004871; Thu, 3 Dec 2020 22:04:57 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : mime-version : content-transfer-encoding; s=corp-2020-01-29; bh=23yexcHdOapSPAxxWiXrX5VndJabrs3xPZqc8iWgG4k=; b=YfiLPGk3+WXqYcec59yf4oBokHYF+nbt7VP+uvNG+YHBlXAyKM1uZrjTNaw5QtT14P2y DLC7bzajN3gR6QcyZjpmUq2/nCP5PNRN4c+V9It1nKQa5phYWGiMOBwziOP8pW5xa4ur u0B1sapFoqDVwTc817L+xJw3WGZhf45k6S/Uag1h6k9mgawVuTrOUrGt75kLFLaMcWKv 0p8lrxvynTxigZpm78mDw0rqw3gPHRhm+NsVC4S6ItoUOiO58MPOdRwxVbwmKcjQDbb7 5+9QosemEnKRZgRIn4Im8pdO9gRW20UUAL7TYhrFXD4osv6a7MBSMEEXhyADQ4MD/JkZ HQ== Received: from aserp3020.oracle.com (aserp3020.oracle.com [141.146.126.70]) by userp2130.oracle.com with ESMTP id 353dyr0gan-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Thu, 03 Dec 2020 22:04:57 +0000 Received: from pps.filterd (aserp3020.oracle.com [127.0.0.1]) by aserp3020.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 0B3LuRHV003859; Thu, 3 Dec 2020 22:02:56 GMT Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by aserp3020.oracle.com with ESMTP id 3540f2furp-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 03 Dec 2020 22:02:56 +0000 Received: from abhmp0019.oracle.com (abhmp0019.oracle.com [141.146.116.25]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id 0B3M2s8C021245; Thu, 3 Dec 2020 22:02:54 GMT Received: from monkey.oracle.com (/50.38.35.18) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Thu, 03 Dec 2020 14:02:54 -0800 From: Mike Kravetz To: linux-mm@kvack.org, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org Cc: Mina Almasry , Shakeel Butt , David Rientjes , Greg Thelen , Sandipan Das , Shuah Khan , Adrian Moreno , Andrew Morton , Mike Kravetz , stable@vger.kernel.org Subject: [PATCH] hugetlb_cgroup: fix offline of hugetlb cgroup with reservations Date: Thu, 3 Dec 2020 14:02:42 -0800 Message-Id: <20201203220242.158165-1-mike.kravetz@oracle.com> X-Mailer: git-send-email 2.28.0 MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9824 signatures=668682 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 mlxscore=0 malwarescore=0 phishscore=0 suspectscore=0 bulkscore=0 spamscore=0 adultscore=0 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2009150000 definitions=main-2012030125 X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9824 signatures=668682 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 malwarescore=0 bulkscore=0 clxscore=1011 mlxscore=0 spamscore=0 priorityscore=1501 mlxlogscore=999 suspectscore=0 lowpriorityscore=0 phishscore=0 adultscore=0 impostorscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2009150000 definitions=main-2012030125 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Adrian Moreno was ruuning a kubernetes 1.19 + containerd/docker workload using hugetlbfs. In this environment the issue is reproduced by: 1 - Start a simple pod that uses the recently added HugePages medium feature (pod yaml attached) 2 - Start a DPDK app. It doesn't need to run successfully (as in transfer packets) nor interact with real hardware. It seems just initializing the EAL layer (which handles hugepage reservation and locking) is enough to trigger the issue 3 - Delete the Pod (or let it "Complete"). This would result in a kworker thread going into a tight loop (top output): 1425 root 20 0 0 0 0 R 99.7 0.0 5:22.45 kworker/28:7+cgroup_destroy 'perf top -g' reports: - 63.28% 0.01% [kernel] [k] worker_thread - 49.97% worker_thread - 52.64% process_one_work - 62.08% css_killed_work_fn - hugetlb_cgroup_css_offline 41.52% _raw_spin_lock - 2.82% _cond_resched rcu_all_qs 2.66% PageHuge - 0.57% schedule - 0.57% __schedule We are spinning in the do-while loop in hugetlb_cgroup_css_offline. Worse yet, we are holding the master cgroup lock (cgroup_mutex) while infinitely spinning. Little else can be done on the system as the cgroup_mutex can not be acquired. Do note that the issue can be reproduced by simply offlining a hugetlb cgroup containing pages with reservation counts. The loop in hugetlb_cgroup_css_offline is moving page counts from the cgroup being offlined to the parent cgroup. This is done for each hstate, and is repeated until hugetlb_cgroup_have_usage returns false. The routine moving counts (hugetlb_cgroup_move_parent) is only moving 'usage' counts. The routine hugetlb_cgroup_have_usage is checking for both 'usage' and 'reservation' counts. Discussion about what to do with reservation counts when reparenting was discussed here: https://lore.kernel.org/linux-kselftest/CAHS8izMFAYTgxym-Hzb_JmkTK1N_S9tGN71uS6MFV+R7swYu5A@mail.gmail.com/ The decision was made to leave a zombie cgroup for with reservation counts. Unfortunately, the code checking reservation counts was incorrectly added to hugetlb_cgroup_have_usage. To fix the issue, simply remove the check for reservation counts. While fixing this issue, a related bug in hugetlb_cgroup_css_offline was noticed. The hstate index is not reinitialized each time through the do-while loop. Fix this as well. Fixes: 1adc4d419aa2 ("hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations") Cc: Reported-by: Adrian Moreno Tested-by: Adrian Moreno Signed-off-by: Mike Kravetz Reviewed-by: Shakeel Butt --- mm/hugetlb_cgroup.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c index 1f87aec9ab5c..9182848dda3e 100644 --- a/mm/hugetlb_cgroup.c +++ b/mm/hugetlb_cgroup.c @@ -82,11 +82,8 @@ static inline bool hugetlb_cgroup_have_usage(struct hugetlb_cgroup *h_cg) for (idx = 0; idx < hugetlb_max_hstate; idx++) { if (page_counter_read( - hugetlb_cgroup_counter_from_cgroup(h_cg, idx)) || - page_counter_read(hugetlb_cgroup_counter_from_cgroup_rsvd( - h_cg, idx))) { + hugetlb_cgroup_counter_from_cgroup(h_cg, idx))) return true; - } } return false; } @@ -202,9 +199,10 @@ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css) struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css); struct hstate *h; struct page *page; - int idx = 0; + int idx; do { + idx = 0; for_each_hstate(h) { spin_lock(&hugetlb_lock); list_for_each_entry(page, &h->hugepage_activelist, lru)