diff mbox series

[08/27] habanalabs: add info when FD released while device still in use

Message ID 20230212204454.2938561-8-ogabbay@kernel.org (mailing list archive)
State New, archived
Headers show
Series [01/27] habanalabs/gaudi2: increase user interrupt grace time | expand

Commit Message

Oded Gabbay Feb. 12, 2023, 8:44 p.m. UTC
From: Tomer Tayar <ttayar@habana.ai>

When user closes the device file descriptor, it is checked whether the
device is still in use, and a message is printed if it is.
To make this message more informative, add to this print also the reason
due to which the device is considered as in use.
The possible reasons which are checked for now are active CS and
exported dma-buf.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
---
 .../habanalabs/common/command_submission.c    | 16 ++++++
 drivers/accel/habanalabs/common/device.c      | 51 +++++++++++++++++--
 drivers/accel/habanalabs/common/habanalabs.h  |  4 +-
 drivers/accel/habanalabs/common/memory.c      |  2 +
 4 files changed, 69 insertions(+), 4 deletions(-)

Comments

Stanislaw Gruszka Feb. 16, 2023, 12:25 p.m. UTC | #1
On Sun, Feb 12, 2023 at 10:44:35PM +0200, Oded Gabbay wrote:
> From: Tomer Tayar <ttayar@habana.ai>
> 
> When user closes the device file descriptor, it is checked whether the
> device is still in use, and a message is printed if it is.

I guess this is only for debugging your user-space component?
Because kernel driver should not make any assumption about
user-space behavior. Closing whenever user wants should be
no problem.

> +static void print_device_in_use_info(struct hl_device *hdev, const char *message)
> +{
> +	u32 active_cs_num, dmabuf_export_cnt;
> +	char buf[64], *buf_ptr = buf;
> +	size_t buf_size = sizeof(buf);
> +	bool unknown_reason = true;
> +
> +	active_cs_num = hl_get_active_cs_num(hdev);
> +	if (active_cs_num) {
> +		unknown_reason = false;
> +		compose_device_in_use_info(&buf_ptr, &buf_size, " [%u active CS]", active_cs_num);
> +	}
> +
> +	dmabuf_export_cnt = atomic_read(&hdev->dmabuf_export_cnt);
> +	if (dmabuf_export_cnt) {
> +		unknown_reason = false;
> +		compose_device_in_use_info(&buf_ptr, &buf_size, " [%u exported dma-buf]",
> +						dmabuf_export_cnt);
> +	}
> +
> +	if (unknown_reason)
> +		compose_device_in_use_info(&buf_ptr, &buf_size, " [unknown reason]");
> +
> +	dev_notice(hdev->dev, "%s%s\n", message, buf);

why not print counters directly, i.e. "active cs count %u, dmabuf export count %u" ?

>  	if (!hl_hpriv_put(hpriv)) {
> -		dev_notice(hdev->dev, "User process closed FD but device still in use\n");
> +		print_device_in_use_info(hdev, "User process closed FD but device still in use");
>  		hl_device_reset(hdev, HL_DRV_RESET_HARD);

You really need reset here ? 

Regards
Stanislaw
Oded Gabbay Feb. 16, 2023, 2:21 p.m. UTC | #2
On Thu, Feb 16, 2023 at 2:25 PM Stanislaw Gruszka
<stanislaw.gruszka@linux.intel.com> wrote:
>
> On Sun, Feb 12, 2023 at 10:44:35PM +0200, Oded Gabbay wrote:
> > From: Tomer Tayar <ttayar@habana.ai>
> >
> > When user closes the device file descriptor, it is checked whether the
> > device is still in use, and a message is printed if it is.
>
> I guess this is only for debugging your user-space component?
> Because kernel driver should not make any assumption about
> user-space behavior. Closing whenever user wants should be
> no problem.
First of all, indeed the user can close the device whatever it wants.
We don't limit him, but we do need to track the device state, because
we can't allow a new user to acquire the device until it is idle (due
to h/w limitations).
Therefore, this print is not so much for debug, as it is for letting
the user know the device wasn't idle after he closed it, and
therefore, we are going to reset it to make it idle.
So, it is a notification that is important imo.

>
> > +static void print_device_in_use_info(struct hl_device *hdev, const char *message)
> > +{
> > +     u32 active_cs_num, dmabuf_export_cnt;
> > +     char buf[64], *buf_ptr = buf;
> > +     size_t buf_size = sizeof(buf);
> > +     bool unknown_reason = true;
> > +
> > +     active_cs_num = hl_get_active_cs_num(hdev);
> > +     if (active_cs_num) {
> > +             unknown_reason = false;
> > +             compose_device_in_use_info(&buf_ptr, &buf_size, " [%u active CS]", active_cs_num);
> > +     }
> > +
> > +     dmabuf_export_cnt = atomic_read(&hdev->dmabuf_export_cnt);
> > +     if (dmabuf_export_cnt) {
> > +             unknown_reason = false;
> > +             compose_device_in_use_info(&buf_ptr, &buf_size, " [%u exported dma-buf]",
> > +                                             dmabuf_export_cnt);
> > +     }
> > +
> > +     if (unknown_reason)
> > +             compose_device_in_use_info(&buf_ptr, &buf_size, " [unknown reason]");
> > +
> > +     dev_notice(hdev->dev, "%s%s\n", message, buf);
>
> why not print counters directly, i.e. "active cs count %u, dmabuf export count %u" ?
Because we wanted to print the specific reason, or unknown reason, and
not print all the possible counters in one line, because most of the
time most of the counters will be 0.
We plan to add more reasons so this helper simplifies the code.

>
> >       if (!hl_hpriv_put(hpriv)) {
> > -             dev_notice(hdev->dev, "User process closed FD but device still in use\n");
> > +             print_device_in_use_info(hdev, "User process closed FD but device still in use");
> >               hl_device_reset(hdev, HL_DRV_RESET_HARD);
>
> You really need reset here ?
Yes, our h/w requires that we reset the device after the user closed
it. If the device is not idle after the user closed it, we hard reset
it.
If it is idle, we do a more graceful reset.
Thanks,
Oded
>
> Regards
> Stanislaw
Stanislaw Gruszka Feb. 16, 2023, 3:04 p.m. UTC | #3
On Thu, Feb 16, 2023 at 04:21:48PM +0200, Oded Gabbay wrote:
> On Thu, Feb 16, 2023 at 2:25 PM Stanislaw Gruszka
> <stanislaw.gruszka@linux.intel.com> wrote:
> >
> > On Sun, Feb 12, 2023 at 10:44:35PM +0200, Oded Gabbay wrote:
> > > From: Tomer Tayar <ttayar@habana.ai>
> > >
> > > When user closes the device file descriptor, it is checked whether the
> > > device is still in use, and a message is printed if it is.
> >
> > I guess this is only for debugging your user-space component?
> > Because kernel driver should not make any assumption about
> > user-space behavior. Closing whenever user wants should be
> > no problem.
> First of all, indeed the user can close the device whatever it wants.
> We don't limit him, but we do need to track the device state, because
> we can't allow a new user to acquire the device until it is idle (due
> to h/w limitations).
> Therefore, this print is not so much for debug, as it is for letting
> the user know the device wasn't idle after he closed it, and
> therefore, we are going to reset it to make it idle.
> So, it is a notification that is important imo.

This sounds like something that should be handed in open() with -EAGAIN
error with eventual message in dmesg. But you know best what info
is needed by user-space :-)

> > > +static void print_device_in_use_info(struct hl_device *hdev, const char *message)
> > > +{
> > > +     u32 active_cs_num, dmabuf_export_cnt;
> > > +     char buf[64], *buf_ptr = buf;
> > > +     size_t buf_size = sizeof(buf);
> > > +     bool unknown_reason = true;
> > > +
> > > +     active_cs_num = hl_get_active_cs_num(hdev);
> > > +     if (active_cs_num) {
> > > +             unknown_reason = false;
> > > +             compose_device_in_use_info(&buf_ptr, &buf_size, " [%u active CS]", active_cs_num);
> > > +     }
> > > +
> > > +     dmabuf_export_cnt = atomic_read(&hdev->dmabuf_export_cnt);
> > > +     if (dmabuf_export_cnt) {
> > > +             unknown_reason = false;
> > > +             compose_device_in_use_info(&buf_ptr, &buf_size, " [%u exported dma-buf]",
> > > +                                             dmabuf_export_cnt);
> > > +     }
> > > +
> > > +     if (unknown_reason)
> > > +             compose_device_in_use_info(&buf_ptr, &buf_size, " [unknown reason]");
> > > +
> > > +     dev_notice(hdev->dev, "%s%s\n", message, buf);
> >
> > why not print counters directly, i.e. "active cs count %u, dmabuf export count %u" ?
> Because we wanted to print the specific reason, or unknown reason, and
> not print all the possible counters in one line, because most of the
> time most of the counters will be 0.
> We plan to add more reasons so this helper simplifies the code.

Ok, just place replace compose_device_in_use_info() with snprintf().
I don't think you need custom implementation of snprintf().

> > > +             print_device_in_use_info(hdev, "User process closed FD but device still in use");
> > >               hl_device_reset(hdev, HL_DRV_RESET_HARD);
> >
> > You really need reset here ?
> Yes, our h/w requires that we reset the device after the user closed
> it. If the device is not idle after the user closed it, we hard reset
> it.
> If it is idle, we do a more graceful reset.

Hmm, ok.

Regards
Stanislaw
Tomer Tayar Feb. 17, 2023, 11:34 a.m. UTC | #4
On Thu, Feb 16, 2023 at 17:05 Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> wrote:
> On Thu, Feb 16, 2023 at 04:21:48PM +0200, Oded Gabbay wrote:
> > On Thu, Feb 16, 2023 at 2:25 PM Stanislaw Gruszka
> > <stanislaw.gruszka@linux.intel.com> wrote:
> > >
> > > On Sun, Feb 12, 2023 at 10:44:35PM +0200, Oded Gabbay wrote:
> > > > From: Tomer Tayar <ttayar@habana.ai>
> > > >
> > > > When user closes the device file descriptor, it is checked whether the
> > > > device is still in use, and a message is printed if it is.
> > >
> > > I guess this is only for debugging your user-space component?
> > > Because kernel driver should not make any assumption about
> > > user-space behavior. Closing whenever user wants should be
> > > no problem.
> > First of all, indeed the user can close the device whatever it wants.
> > We don't limit him, but we do need to track the device state, because
> > we can't allow a new user to acquire the device until it is idle (due
> > to h/w limitations).
> > Therefore, this print is not so much for debug, as it is for letting
> > the user know the device wasn't idle after he closed it, and
> > therefore, we are going to reset it to make it idle.
> > So, it is a notification that is important imo.
> 
> This sounds like something that should be handed in open() with -EAGAIN
> error with eventual message in dmesg. But you know best what info
> is needed by user-space :-)

Because of the reset in this case and the involved cleanup, this info won't be available in next open().

> > > > +static void print_device_in_use_info(struct hl_device *hdev, const char
> *message)
> > > > +{
> > > > +     u32 active_cs_num, dmabuf_export_cnt;
> > > > +     char buf[64], *buf_ptr = buf;
> > > > +     size_t buf_size = sizeof(buf);
> > > > +     bool unknown_reason = true;
> > > > +
> > > > +     active_cs_num = hl_get_active_cs_num(hdev);
> > > > +     if (active_cs_num) {
> > > > +             unknown_reason = false;
> > > > +             compose_device_in_use_info(&buf_ptr, &buf_size, " [%u active
> CS]", active_cs_num);
> > > > +     }
> > > > +
> > > > +     dmabuf_export_cnt = atomic_read(&hdev->dmabuf_export_cnt);
> > > > +     if (dmabuf_export_cnt) {
> > > > +             unknown_reason = false;
> > > > +             compose_device_in_use_info(&buf_ptr, &buf_size, " [%u
> exported dma-buf]",
> > > > +                                             dmabuf_export_cnt);
> > > > +     }
> > > > +
> > > > +     if (unknown_reason)
> > > > +             compose_device_in_use_info(&buf_ptr, &buf_size, " [unknown
> reason]");
> > > > +
> > > > +     dev_notice(hdev->dev, "%s%s\n", message, buf);
> > >
> > > why not print counters directly, i.e. "active cs count %u, dmabuf export
> count %u" ?
> > Because we wanted to print the specific reason, or unknown reason, and
> > not print all the possible counters in one line, because most of the
> > time most of the counters will be 0.
> > We plan to add more reasons so this helper simplifies the code.
> 
> Ok, just place replace compose_device_in_use_info() with snprintf().
> I don't think you need custom implementation of snprintf().

compose_device_in_use_info() was added to handle in a single place the snprintf() return value and the buffer pointer moving.
However, you are correct and it is too much here, as the local buffer size is set with a value that should be enough for max possible print.
We will remove compose_device_in_use_info() and use snprintf() directly.
Thanks!

> > > > +             print_device_in_use_info(hdev, "User process closed FD but
> device still in use");
> > > >               hl_device_reset(hdev, HL_DRV_RESET_HARD);
> > >
> > > You really need reset here ?
> > Yes, our h/w requires that we reset the device after the user closed
> > it. If the device is not idle after the user closed it, we hard reset
> > it.
> > If it is idle, we do a more graceful reset.
> 
> Hmm, ok.
> 
> Regards
> Stanislaw
Stanislaw Gruszka Feb. 20, 2023, 3:54 p.m. UTC | #5
On Fri, Feb 17, 2023 at 11:34:39AM +0000, Tomer Tayar wrote:
 > 
> > Ok, just place replace compose_device_in_use_info() with snprintf().
> > I don't think you need custom implementation of snprintf().
> 
> compose_device_in_use_info() was added to handle in a single place the snprintf() return value and the buffer pointer moving.
> However, you are correct and it is too much here, as the local buffer size is set with a value that should be enough for max possible print.
> We will remove compose_device_in_use_info() and use snprintf() directly.

Actually the safer version would be scnprintf() since for that function
return value could not be bigger than passed len. Usage then could be
as simple as:

n += scnprintf(buf + n, len - n, ...);
n += scnprintf(buf + n, len - n, ...);

Regards
Stanislaw
Tomer Tayar Feb. 20, 2023, 4:16 p.m. UTC | #6
On Thu, Feb 20, 2023 at 17:55 Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> wrote:
> On Fri, Feb 17, 2023 at 11:34:39AM +0000, Tomer Tayar wrote:
>  >
> > > Ok, just place replace compose_device_in_use_info() with snprintf().
> > > I don't think you need custom implementation of snprintf().
> >
> > compose_device_in_use_info() was added to handle in a single place the
> snprintf() return value and the buffer pointer moving.
> > However, you are correct and it is too much here, as the local buffer size is set
> with a value that should be enough for max possible print.
> > We will remove compose_device_in_use_info() and use snprintf() directly.
> 
> Actually the safer version would be scnprintf() since for that function
> return value could not be bigger than passed len. Usage then could be
> as simple as:
> 
> n += scnprintf(buf + n, len - n, ...);
> n += scnprintf(buf + n, len - n, ...);
> 
> Regards
> Stanislaw

Sure, we will use it, thanks!
diff mbox series

Patch

diff --git a/drivers/accel/habanalabs/common/command_submission.c b/drivers/accel/habanalabs/common/command_submission.c
index e313ff8af7cc..24f3d82ee4cb 100644
--- a/drivers/accel/habanalabs/common/command_submission.c
+++ b/drivers/accel/habanalabs/common/command_submission.c
@@ -1168,6 +1168,22 @@  static void cs_completion(struct work_struct *work)
 		hl_complete_job(hdev, job);
 }
 
+u32 hl_get_active_cs_num(struct hl_device *hdev)
+{
+	u32 active_cs_num = 0;
+	struct hl_cs *cs;
+
+	spin_lock(&hdev->cs_mirror_lock);
+
+	list_for_each_entry(cs, &hdev->cs_mirror_list, mirror_node)
+		if (!cs->completed)
+			active_cs_num++;
+
+	spin_unlock(&hdev->cs_mirror_lock);
+
+	return active_cs_num;
+}
+
 static int validate_queue_index(struct hl_device *hdev,
 				struct hl_cs_chunk *chunk,
 				enum hl_queue_type *queue_type,
diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c
index 194c282d7e55..b8c74185eabd 100644
--- a/drivers/accel/habanalabs/common/device.c
+++ b/drivers/accel/habanalabs/common/device.c
@@ -492,6 +492,52 @@  int hl_hpriv_put(struct hl_fpriv *hpriv)
 	return kref_put(&hpriv->refcount, hpriv_release);
 }
 
+static void compose_device_in_use_info(char **buf, size_t *buf_size, const char *fmt, ...)
+{
+	struct va_format vaf;
+	va_list args;
+	int size;
+
+	va_start(args, fmt);
+	vaf.fmt = fmt;
+	vaf.va = &args;
+
+	size = snprintf(*buf, *buf_size, "%pV", &vaf);
+	if (size >= *buf_size)
+		size = *buf_size;
+
+	*buf += size;
+	*buf_size -= size;
+
+	va_end(args);
+}
+
+static void print_device_in_use_info(struct hl_device *hdev, const char *message)
+{
+	u32 active_cs_num, dmabuf_export_cnt;
+	char buf[64], *buf_ptr = buf;
+	size_t buf_size = sizeof(buf);
+	bool unknown_reason = true;
+
+	active_cs_num = hl_get_active_cs_num(hdev);
+	if (active_cs_num) {
+		unknown_reason = false;
+		compose_device_in_use_info(&buf_ptr, &buf_size, " [%u active CS]", active_cs_num);
+	}
+
+	dmabuf_export_cnt = atomic_read(&hdev->dmabuf_export_cnt);
+	if (dmabuf_export_cnt) {
+		unknown_reason = false;
+		compose_device_in_use_info(&buf_ptr, &buf_size, " [%u exported dma-buf]",
+						dmabuf_export_cnt);
+	}
+
+	if (unknown_reason)
+		compose_device_in_use_info(&buf_ptr, &buf_size, " [unknown reason]");
+
+	dev_notice(hdev->dev, "%s%s\n", message, buf);
+}
+
 /*
  * hl_device_release - release function for habanalabs device
  *
@@ -519,12 +565,11 @@  static int hl_device_release(struct inode *inode, struct file *filp)
 	hdev->compute_ctx_in_release = 1;
 
 	if (!hl_hpriv_put(hpriv)) {
-		dev_notice(hdev->dev, "User process closed FD but device still in use\n");
+		print_device_in_use_info(hdev, "User process closed FD but device still in use");
 		hl_device_reset(hdev, HL_DRV_RESET_HARD);
 	}
 
-	hdev->last_open_session_duration_jif =
-		jiffies - hdev->last_successful_open_jif;
+	hdev->last_open_session_duration_jif = jiffies - hdev->last_successful_open_jif;
 
 	return 0;
 }
diff --git a/drivers/accel/habanalabs/common/habanalabs.h b/drivers/accel/habanalabs/common/habanalabs.h
index d98e6c0feb24..afdae5775eaa 100644
--- a/drivers/accel/habanalabs/common/habanalabs.h
+++ b/drivers/accel/habanalabs/common/habanalabs.h
@@ -3200,6 +3200,7 @@  struct hl_reset_info {
  *                drams are binned-out
  * @tpc_binning: contains mask of tpc engines that is received from the f/w which indicates which
  *               tpc engines are binned-out
+ * @dmabuf_export_cnt: number of dma-buf exporting.
  * @card_type: Various ASICs have several card types. This indicates the card
  *             type of the current device.
  * @major: habanalabs kernel driver major.
@@ -3371,7 +3372,7 @@  struct hl_device {
 	u64				fw_comms_poll_interval_usec;
 	u64				dram_binning;
 	u64				tpc_binning;
-
+	atomic_t			dmabuf_export_cnt;
 	enum cpucp_card_types		card_type;
 	u32				major;
 	u32				high_pll;
@@ -3664,6 +3665,7 @@  bool cs_needs_timeout(struct hl_cs *cs);
 bool is_staged_cs_last_exists(struct hl_device *hdev, struct hl_cs *cs);
 struct hl_cs *hl_staged_cs_find_first(struct hl_device *hdev, u64 cs_seq);
 void hl_multi_cs_completion_init(struct hl_device *hdev);
+u32 hl_get_active_cs_num(struct hl_device *hdev);
 
 void goya_set_asic_funcs(struct hl_device *hdev);
 void gaudi_set_asic_funcs(struct hl_device *hdev);
diff --git a/drivers/accel/habanalabs/common/memory.c b/drivers/accel/habanalabs/common/memory.c
index 7b3809853dd5..e115d264b03b 100644
--- a/drivers/accel/habanalabs/common/memory.c
+++ b/drivers/accel/habanalabs/common/memory.c
@@ -1833,6 +1833,7 @@  static void hl_release_dmabuf(struct dma_buf *dmabuf)
 	if (hl_dmabuf->memhash_hnode)
 		memhash_node_export_put(ctx, hl_dmabuf->memhash_hnode);
 
+	atomic_dec(&ctx->hdev->dmabuf_export_cnt);
 	hl_ctx_put(ctx);
 	kfree(hl_dmabuf);
 }
@@ -1872,6 +1873,7 @@  static int export_dmabuf(struct hl_ctx *ctx,
 
 	hl_dmabuf->ctx = ctx;
 	hl_ctx_get(hl_dmabuf->ctx);
+	atomic_inc(&ctx->hdev->dmabuf_export_cnt);
 
 	*dmabuf_fd = fd;