Message ID | 20201203220242.158165-1-mike.kravetz@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | hugetlb_cgroup: fix offline of hugetlb cgroup with reservations | expand |
On Thu, Dec 3, 2020 at 2:04 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > Adrian Moreno was ruuning a kubernetes 1.19 + containerd/docker workload > using hugetlbfs. In this environment the issue is reproduced by: > 1 - Start a simple pod that uses the recently added HugePages medium > feature (pod yaml attached) > 2 - Start a DPDK app. It doesn't need to run successfully (as in transfer > packets) nor interact with real hardware. It seems just initializing > the EAL layer (which handles hugepage reservation and locking) is > enough to trigger the issue > 3 - Delete the Pod (or let it "Complete"). > > This would result in a kworker thread going into a tight loop (top output): > 1425 root 20 0 0 0 0 R 99.7 0.0 5:22.45 > kworker/28:7+cgroup_destroy > > 'perf top -g' reports: > - 63.28% 0.01% [kernel] [k] worker_thread > - 49.97% worker_thread > - 52.64% process_one_work > - 62.08% css_killed_work_fn > - hugetlb_cgroup_css_offline > 41.52% _raw_spin_lock > - 2.82% _cond_resched > rcu_all_qs > 2.66% PageHuge > - 0.57% schedule > - 0.57% __schedule > > We are spinning in the do-while loop in hugetlb_cgroup_css_offline. > Worse yet, we are holding the master cgroup lock (cgroup_mutex) while > infinitely spinning. Little else can be done on the system as the > cgroup_mutex can not be acquired. > > Do note that the issue can be reproduced by simply offlining a hugetlb > cgroup containing pages with reservation counts. > > The loop in hugetlb_cgroup_css_offline is moving page counts from the > cgroup being offlined to the parent cgroup. This is done for each hstate, > and is repeated until hugetlb_cgroup_have_usage returns false. The routine > moving counts (hugetlb_cgroup_move_parent) is only moving 'usage' counts. > The routine hugetlb_cgroup_have_usage is checking for both 'usage' and > 'reservation' counts. Discussion about what to do with reservation > counts when reparenting was discussed here: > > https://lore.kernel.org/linux-kselftest/CAHS8izMFAYTgxym-Hzb_JmkTK1N_S9tGN71uS6MFV+R7swYu5A@mail.gmail.com/ > > The decision was made to leave a zombie cgroup for with reservation > counts. Unfortunately, the code checking reservation counts was > incorrectly added to hugetlb_cgroup_have_usage. > > To fix the issue, simply remove the check for reservation counts. While > fixing this issue, a related bug in hugetlb_cgroup_css_offline was noticed. > The hstate index is not reinitialized each time through the do-while loop. > Fix this as well. > > Fixes: 1adc4d419aa2 ("hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations") > Cc: <stable@vger.kernel.org> > Reported-by: Adrian Moreno <amorenoz@redhat.com> > Tested-by: Adrian Moreno <amorenoz@redhat.com> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Shakeel Butt <shakeelb@google.com>
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c index 1f87aec9ab5c..9182848dda3e 100644 --- a/mm/hugetlb_cgroup.c +++ b/mm/hugetlb_cgroup.c @@ -82,11 +82,8 @@ static inline bool hugetlb_cgroup_have_usage(struct hugetlb_cgroup *h_cg) for (idx = 0; idx < hugetlb_max_hstate; idx++) { if (page_counter_read( - hugetlb_cgroup_counter_from_cgroup(h_cg, idx)) || - page_counter_read(hugetlb_cgroup_counter_from_cgroup_rsvd( - h_cg, idx))) { + hugetlb_cgroup_counter_from_cgroup(h_cg, idx))) return true; - } } return false; } @@ -202,9 +199,10 @@ static void hugetlb_cgroup_css_offline(struct cgroup_subsys_state *css) struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css); struct hstate *h; struct page *page; - int idx = 0; + int idx; do { + idx = 0; for_each_hstate(h) { spin_lock(&hugetlb_lock); list_for_each_entry(page, &h->hugepage_activelist, lru)