From patchwork Mon May 6 05:29:32 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mingwei Zhang X-Patchwork-Id: 13654915 Received: from mail-pf1-f201.google.com (mail-pf1-f201.google.com [209.85.210.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AF90C6A008 for ; Mon, 6 May 2024 05:30:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1714973442; cv=none; b=q+uvQ9uxfFhWh6OflTNHk0hEXfcVDv/Gz3sWPw2I0BNSgcHU07JUk3fBENj8i2kpOBKBTOcU+WsFMlyDxRyABmhm3Xqc06I/O6goBBl/ZbDShTXBCA2Y1MwW6MBg73jbzZqo/rFLxELSUrQA+9QEzm5eRRlGUxSzxxyHyCBUIws= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1714973442; c=relaxed/simple; bh=J71VjJ5Nkkpbaj0gjw8oPf7WOPc3B55n16+J37qG1h8=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=EIqm5Oxjspjsr4BRdSY98V68H+D0anhbhgueUjcE+4R5qtJQmC7gdaAZfIFRjxYI8vUdbM5vPe9pZ+IOsnSkQO7UtZpujXLj+T9trIOBsS2OPxMCqS3h2pVGFw3fJGhgNldFA2qGgKcRPkG+bqnqiz1ZwA0SLdhrgrtNUroIS/c= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--mizhang.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=MCsvrl7E; arc=none smtp.client-ip=209.85.210.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--mizhang.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="MCsvrl7E" Received: by mail-pf1-f201.google.com with SMTP id d2e1a72fcca58-6f449ea6804so2151550b3a.2 for ; Sun, 05 May 2024 22:30:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1714973440; x=1715578240; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:from:to:cc:subject:date:message-id:reply-to; bh=2JGTkNqizi9Vg1BZR7juTMryMl6SxngNGt1WQcBSzhE=; b=MCsvrl7Ezw0XeSLSMLiXzCiDZafWA1m7eBU+RUxauyvbYgqT3EbhANB92VSuyCJ26i Gt0fOhMuDvqHwgKurXiVdrBC4K2EOhRwaMJpEGAoiLiqKUYsVHu3ErxJ9DrAMwC/EBBS 4E3YsC1CQRUmqSIT38iDRGmcmtOXYMObyX2ugH4G7oIU8iCpHmffhplIXW7PGkEQXtp8 jlOY836/B1qUybojC635gTkqG0wY2+LSo1mbhX6ZlvNOUKGTbe76ZdcVAM6ccFJIYR2o KPZ6UG5KfsOdq3CVrSiCwQXXQm7Cub69sY/wNclIM5dcRs4c6nljP4NDbMoQwo4sbdpK TcDw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1714973440; x=1715578240; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=2JGTkNqizi9Vg1BZR7juTMryMl6SxngNGt1WQcBSzhE=; b=eq+AKGk+cCV6hNInRVD9CIFnby8/DQPyt2tlZv9l5zwy4HeIMx70Mu5KdwwfJPhH7q s9/PA+puCG+B4fh5mckpeN2BlWm/sR+jIyPDQeYIxMMCJncrOxQoernT1QAAD73VkBbe B0j1VrfiGc8Hk6fgczrIYkwx5Ladc+5NyYRE/Xad1MKcyhnknizQXkhRSEcbcIruMXz1 BMEQtS4R98Okk0PaU5xCvSh7XIXE/pb/N9B++JAZ2Fm0IpXPWz3uqtTcE/22XbzG6Al1 11knmPedpqwuo1VTC4bLxLvo5UX5CfuAM97GC9rNvg2rf35zb4UHR0kIGL3lZ0Zwtdf+ 8phQ== X-Forwarded-Encrypted: i=1; AJvYcCUMleLMKjbSOfMsIxjnKvMMO1//jL8TxUeaFeY7jNqCc7NxMrinveTub0TKEr6yzSlOpsAmVJcCNHpywe4CrlqgN2ax X-Gm-Message-State: AOJu0YzDeYBIXnFWhS1tK6oD8zKg3x8CjIqXvA5uSjYm1qt1bByEwVhZ FNl1pf0lmB11rWHj8F+ObPcqYb4EE3uRYFBWkg4CN6aUNtxMgZp+QI4pjB0LBMcqpO13szkv9cK wgIm5Yg== X-Google-Smtp-Source: AGHT+IGuFOm1jHHHZYze2f1GD5mh74ABhkOkHxNWYBnbZD7Y6g4eDyhDjnqH2Uh8t2wKfeoVU4DR+qm4Smi5 X-Received: from mizhang-super.c.googlers.com ([34.105.13.176]) (user=mizhang job=sendgmr) by 2002:a05:6a00:13a0:b0:6f0:b67e:dd0c with SMTP id t32-20020a056a0013a000b006f0b67edd0cmr365541pfg.5.1714973440068; Sun, 05 May 2024 22:30:40 -0700 (PDT) Reply-To: Mingwei Zhang Date: Mon, 6 May 2024 05:29:32 +0000 In-Reply-To: <20240506053020.3911940-1-mizhang@google.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240506053020.3911940-1-mizhang@google.com> X-Mailer: git-send-email 2.45.0.rc1.225.g2a3ae87e7f-goog Message-ID: <20240506053020.3911940-8-mizhang@google.com> Subject: [PATCH v2 07/54] perf: Add generic exclude_guest support From: Mingwei Zhang To: Sean Christopherson , Paolo Bonzini , Xiong Zhang , Dapeng Mi , Kan Liang , Zhenyu Wang , Manali Shukla , Sandipan Das Cc: Jim Mattson , Stephane Eranian , Ian Rogers , Namhyung Kim , Mingwei Zhang , gce-passthrou-pmu-dev@google.com, Samantha Alt , Zhiyuan Lv , Yanfei Xu , maobibo , Like Xu , Peter Zijlstra , kvm@vger.kernel.org, linux-perf-users@vger.kernel.org From: Kan Liang Current perf doesn't explicitly schedule out all exclude_guest events while the guest is running. There is no problem for the current emulated vPMU. Because perf owns all the PMU counters. It can mask the counter which is assigned to a exclude_guest event when a guest is running (Intel way), or set the correspoinding HOSTONLY bit in evsentsel (AMD way). The counter doesn't count when a guest is running. However, either way doesn't work with the passthrough vPMU introduced. A guest owns all the PMU counters when it's running. Host should not mask any counters. The counter may be used by the guest. The evsentsel may be overwrite. Perf should explicitly schedule out all exclude_guest events to release the PMU resources when entering a guest, and resume the counting when exiting the guest. Expose two interfaces to KVM. The KVM should notify the perf when entering/exiting a guest. It's possible that a exclude_guest event is created when a guest is running. The new event should not be scheduled in as well. Signed-off-by: Kan Liang Signed-off-by: Kan Liang Signed-off-by: Kan Liang --- include/linux/perf_event.h | 4 ++ kernel/events/core.c | 104 +++++++++++++++++++++++++++++++++++++ 2 files changed, 108 insertions(+) diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index dd4920bf3d1b..acf16676401a 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -1734,6 +1734,8 @@ extern int perf_event_period(struct perf_event *event, u64 value); extern u64 perf_event_pause(struct perf_event *event, bool reset); extern int perf_get_mediated_pmu(void); extern void perf_put_mediated_pmu(void); +void perf_guest_enter(void); +void perf_guest_exit(void); #else /* !CONFIG_PERF_EVENTS: */ static inline void * perf_aux_output_begin(struct perf_output_handle *handle, @@ -1826,6 +1828,8 @@ static inline int perf_get_mediated_pmu(void) } static inline void perf_put_mediated_pmu(void) { } +static inline void perf_guest_enter(void) { } +static inline void perf_guest_exit(void) { } #endif #if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL) diff --git a/kernel/events/core.c b/kernel/events/core.c index 701b622c670e..4c6daf5cc923 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -406,6 +406,7 @@ static atomic_t nr_include_guest_events __read_mostly; static refcount_t nr_mediated_pmu_vms = REFCOUNT_INIT(0); static DEFINE_MUTEX(perf_mediated_pmu_mutex); +static DEFINE_PER_CPU(bool, perf_in_guest); /* !exclude_guest system wide event of PMU with PERF_PMU_CAP_PASSTHROUGH_VPMU */ static inline bool is_include_guest_event(struct perf_event *event) @@ -3854,6 +3855,15 @@ static int merge_sched_in(struct perf_event *event, void *data) if (!event_filter_match(event)) return 0; + /* + * Don't schedule in any exclude_guest events of PMU with + * PERF_PMU_CAP_PASSTHROUGH_VPMU, while a guest is running. + */ + if (__this_cpu_read(perf_in_guest) && + event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU && + event->attr.exclude_guest) + return 0; + if (group_can_go_on(event, *can_add_hw)) { if (!group_sched_in(event, ctx)) list_add_tail(&event->active_list, get_event_list(event)); @@ -5791,6 +5801,100 @@ void perf_put_mediated_pmu(void) } EXPORT_SYMBOL_GPL(perf_put_mediated_pmu); +static void perf_sched_out_exclude_guest(struct perf_event_context *ctx) +{ + struct perf_event_pmu_context *pmu_ctx; + + update_context_time(ctx); + list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) { + struct perf_event *event, *tmp; + struct pmu *pmu = pmu_ctx->pmu; + + if (!(pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU)) + continue; + + perf_pmu_disable(pmu); + + /* + * All active events must be exclude_guest events. + * See perf_get_mediated_pmu(). + * Unconditionally remove all active events. + */ + list_for_each_entry_safe(event, tmp, &pmu_ctx->pinned_active, active_list) + group_sched_out(event, pmu_ctx->ctx); + + list_for_each_entry_safe(event, tmp, &pmu_ctx->flexible_active, active_list) + group_sched_out(event, pmu_ctx->ctx); + + pmu_ctx->rotate_necessary = 0; + + perf_pmu_enable(pmu); + } +} + +/* When entering a guest, schedule out all exclude_guest events. */ +void perf_guest_enter(void) +{ + struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context); + + lockdep_assert_irqs_disabled(); + + perf_ctx_lock(cpuctx, cpuctx->task_ctx); + + if (WARN_ON_ONCE(__this_cpu_read(perf_in_guest))) { + perf_ctx_unlock(cpuctx, cpuctx->task_ctx); + return; + } + + perf_sched_out_exclude_guest(&cpuctx->ctx); + if (cpuctx->task_ctx) + perf_sched_out_exclude_guest(cpuctx->task_ctx); + + __this_cpu_write(perf_in_guest, true); + + perf_ctx_unlock(cpuctx, cpuctx->task_ctx); +} + +static void perf_sched_in_exclude_guest(struct perf_event_context *ctx) +{ + struct perf_event_pmu_context *pmu_ctx; + + update_context_time(ctx); + list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) { + struct pmu *pmu = pmu_ctx->pmu; + + if (!(pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU)) + continue; + + perf_pmu_disable(pmu); + pmu_groups_sched_in(ctx, &ctx->pinned_groups, pmu); + pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu); + perf_pmu_enable(pmu); + } +} + +void perf_guest_exit(void) +{ + struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context); + + lockdep_assert_irqs_disabled(); + + perf_ctx_lock(cpuctx, cpuctx->task_ctx); + + if (WARN_ON_ONCE(!__this_cpu_read(perf_in_guest))) { + perf_ctx_unlock(cpuctx, cpuctx->task_ctx); + return; + } + + __this_cpu_write(perf_in_guest, false); + + perf_sched_in_exclude_guest(&cpuctx->ctx); + if (cpuctx->task_ctx) + perf_sched_in_exclude_guest(cpuctx->task_ctx); + + perf_ctx_unlock(cpuctx, cpuctx->task_ctx); +} + /* * Holding the top-level event's child_mutex means that any * descendant process that has inherited this event will block