Message ID | 20230212204454.2938561-8-ogabbay@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [01/27] habanalabs/gaudi2: increase user interrupt grace time | expand |
On Sun, Feb 12, 2023 at 10:44:35PM +0200, Oded Gabbay wrote: > From: Tomer Tayar <ttayar@habana.ai> > > When user closes the device file descriptor, it is checked whether the > device is still in use, and a message is printed if it is. I guess this is only for debugging your user-space component? Because kernel driver should not make any assumption about user-space behavior. Closing whenever user wants should be no problem. > +static void print_device_in_use_info(struct hl_device *hdev, const char *message) > +{ > + u32 active_cs_num, dmabuf_export_cnt; > + char buf[64], *buf_ptr = buf; > + size_t buf_size = sizeof(buf); > + bool unknown_reason = true; > + > + active_cs_num = hl_get_active_cs_num(hdev); > + if (active_cs_num) { > + unknown_reason = false; > + compose_device_in_use_info(&buf_ptr, &buf_size, " [%u active CS]", active_cs_num); > + } > + > + dmabuf_export_cnt = atomic_read(&hdev->dmabuf_export_cnt); > + if (dmabuf_export_cnt) { > + unknown_reason = false; > + compose_device_in_use_info(&buf_ptr, &buf_size, " [%u exported dma-buf]", > + dmabuf_export_cnt); > + } > + > + if (unknown_reason) > + compose_device_in_use_info(&buf_ptr, &buf_size, " [unknown reason]"); > + > + dev_notice(hdev->dev, "%s%s\n", message, buf); why not print counters directly, i.e. "active cs count %u, dmabuf export count %u" ? > if (!hl_hpriv_put(hpriv)) { > - dev_notice(hdev->dev, "User process closed FD but device still in use\n"); > + print_device_in_use_info(hdev, "User process closed FD but device still in use"); > hl_device_reset(hdev, HL_DRV_RESET_HARD); You really need reset here ? Regards Stanislaw
On Thu, Feb 16, 2023 at 2:25 PM Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> wrote: > > On Sun, Feb 12, 2023 at 10:44:35PM +0200, Oded Gabbay wrote: > > From: Tomer Tayar <ttayar@habana.ai> > > > > When user closes the device file descriptor, it is checked whether the > > device is still in use, and a message is printed if it is. > > I guess this is only for debugging your user-space component? > Because kernel driver should not make any assumption about > user-space behavior. Closing whenever user wants should be > no problem. First of all, indeed the user can close the device whatever it wants. We don't limit him, but we do need to track the device state, because we can't allow a new user to acquire the device until it is idle (due to h/w limitations). Therefore, this print is not so much for debug, as it is for letting the user know the device wasn't idle after he closed it, and therefore, we are going to reset it to make it idle. So, it is a notification that is important imo. > > > +static void print_device_in_use_info(struct hl_device *hdev, const char *message) > > +{ > > + u32 active_cs_num, dmabuf_export_cnt; > > + char buf[64], *buf_ptr = buf; > > + size_t buf_size = sizeof(buf); > > + bool unknown_reason = true; > > + > > + active_cs_num = hl_get_active_cs_num(hdev); > > + if (active_cs_num) { > > + unknown_reason = false; > > + compose_device_in_use_info(&buf_ptr, &buf_size, " [%u active CS]", active_cs_num); > > + } > > + > > + dmabuf_export_cnt = atomic_read(&hdev->dmabuf_export_cnt); > > + if (dmabuf_export_cnt) { > > + unknown_reason = false; > > + compose_device_in_use_info(&buf_ptr, &buf_size, " [%u exported dma-buf]", > > + dmabuf_export_cnt); > > + } > > + > > + if (unknown_reason) > > + compose_device_in_use_info(&buf_ptr, &buf_size, " [unknown reason]"); > > + > > + dev_notice(hdev->dev, "%s%s\n", message, buf); > > why not print counters directly, i.e. "active cs count %u, dmabuf export count %u" ? Because we wanted to print the specific reason, or unknown reason, and not print all the possible counters in one line, because most of the time most of the counters will be 0. We plan to add more reasons so this helper simplifies the code. > > > if (!hl_hpriv_put(hpriv)) { > > - dev_notice(hdev->dev, "User process closed FD but device still in use\n"); > > + print_device_in_use_info(hdev, "User process closed FD but device still in use"); > > hl_device_reset(hdev, HL_DRV_RESET_HARD); > > You really need reset here ? Yes, our h/w requires that we reset the device after the user closed it. If the device is not idle after the user closed it, we hard reset it. If it is idle, we do a more graceful reset. Thanks, Oded > > Regards > Stanislaw
On Thu, Feb 16, 2023 at 04:21:48PM +0200, Oded Gabbay wrote: > On Thu, Feb 16, 2023 at 2:25 PM Stanislaw Gruszka > <stanislaw.gruszka@linux.intel.com> wrote: > > > > On Sun, Feb 12, 2023 at 10:44:35PM +0200, Oded Gabbay wrote: > > > From: Tomer Tayar <ttayar@habana.ai> > > > > > > When user closes the device file descriptor, it is checked whether the > > > device is still in use, and a message is printed if it is. > > > > I guess this is only for debugging your user-space component? > > Because kernel driver should not make any assumption about > > user-space behavior. Closing whenever user wants should be > > no problem. > First of all, indeed the user can close the device whatever it wants. > We don't limit him, but we do need to track the device state, because > we can't allow a new user to acquire the device until it is idle (due > to h/w limitations). > Therefore, this print is not so much for debug, as it is for letting > the user know the device wasn't idle after he closed it, and > therefore, we are going to reset it to make it idle. > So, it is a notification that is important imo. This sounds like something that should be handed in open() with -EAGAIN error with eventual message in dmesg. But you know best what info is needed by user-space :-) > > > +static void print_device_in_use_info(struct hl_device *hdev, const char *message) > > > +{ > > > + u32 active_cs_num, dmabuf_export_cnt; > > > + char buf[64], *buf_ptr = buf; > > > + size_t buf_size = sizeof(buf); > > > + bool unknown_reason = true; > > > + > > > + active_cs_num = hl_get_active_cs_num(hdev); > > > + if (active_cs_num) { > > > + unknown_reason = false; > > > + compose_device_in_use_info(&buf_ptr, &buf_size, " [%u active CS]", active_cs_num); > > > + } > > > + > > > + dmabuf_export_cnt = atomic_read(&hdev->dmabuf_export_cnt); > > > + if (dmabuf_export_cnt) { > > > + unknown_reason = false; > > > + compose_device_in_use_info(&buf_ptr, &buf_size, " [%u exported dma-buf]", > > > + dmabuf_export_cnt); > > > + } > > > + > > > + if (unknown_reason) > > > + compose_device_in_use_info(&buf_ptr, &buf_size, " [unknown reason]"); > > > + > > > + dev_notice(hdev->dev, "%s%s\n", message, buf); > > > > why not print counters directly, i.e. "active cs count %u, dmabuf export count %u" ? > Because we wanted to print the specific reason, or unknown reason, and > not print all the possible counters in one line, because most of the > time most of the counters will be 0. > We plan to add more reasons so this helper simplifies the code. Ok, just place replace compose_device_in_use_info() with snprintf(). I don't think you need custom implementation of snprintf(). > > > + print_device_in_use_info(hdev, "User process closed FD but device still in use"); > > > hl_device_reset(hdev, HL_DRV_RESET_HARD); > > > > You really need reset here ? > Yes, our h/w requires that we reset the device after the user closed > it. If the device is not idle after the user closed it, we hard reset > it. > If it is idle, we do a more graceful reset. Hmm, ok. Regards Stanislaw
On Thu, Feb 16, 2023 at 17:05 Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> wrote: > On Thu, Feb 16, 2023 at 04:21:48PM +0200, Oded Gabbay wrote: > > On Thu, Feb 16, 2023 at 2:25 PM Stanislaw Gruszka > > <stanislaw.gruszka@linux.intel.com> wrote: > > > > > > On Sun, Feb 12, 2023 at 10:44:35PM +0200, Oded Gabbay wrote: > > > > From: Tomer Tayar <ttayar@habana.ai> > > > > > > > > When user closes the device file descriptor, it is checked whether the > > > > device is still in use, and a message is printed if it is. > > > > > > I guess this is only for debugging your user-space component? > > > Because kernel driver should not make any assumption about > > > user-space behavior. Closing whenever user wants should be > > > no problem. > > First of all, indeed the user can close the device whatever it wants. > > We don't limit him, but we do need to track the device state, because > > we can't allow a new user to acquire the device until it is idle (due > > to h/w limitations). > > Therefore, this print is not so much for debug, as it is for letting > > the user know the device wasn't idle after he closed it, and > > therefore, we are going to reset it to make it idle. > > So, it is a notification that is important imo. > > This sounds like something that should be handed in open() with -EAGAIN > error with eventual message in dmesg. But you know best what info > is needed by user-space :-) Because of the reset in this case and the involved cleanup, this info won't be available in next open(). > > > > +static void print_device_in_use_info(struct hl_device *hdev, const char > *message) > > > > +{ > > > > + u32 active_cs_num, dmabuf_export_cnt; > > > > + char buf[64], *buf_ptr = buf; > > > > + size_t buf_size = sizeof(buf); > > > > + bool unknown_reason = true; > > > > + > > > > + active_cs_num = hl_get_active_cs_num(hdev); > > > > + if (active_cs_num) { > > > > + unknown_reason = false; > > > > + compose_device_in_use_info(&buf_ptr, &buf_size, " [%u active > CS]", active_cs_num); > > > > + } > > > > + > > > > + dmabuf_export_cnt = atomic_read(&hdev->dmabuf_export_cnt); > > > > + if (dmabuf_export_cnt) { > > > > + unknown_reason = false; > > > > + compose_device_in_use_info(&buf_ptr, &buf_size, " [%u > exported dma-buf]", > > > > + dmabuf_export_cnt); > > > > + } > > > > + > > > > + if (unknown_reason) > > > > + compose_device_in_use_info(&buf_ptr, &buf_size, " [unknown > reason]"); > > > > + > > > > + dev_notice(hdev->dev, "%s%s\n", message, buf); > > > > > > why not print counters directly, i.e. "active cs count %u, dmabuf export > count %u" ? > > Because we wanted to print the specific reason, or unknown reason, and > > not print all the possible counters in one line, because most of the > > time most of the counters will be 0. > > We plan to add more reasons so this helper simplifies the code. > > Ok, just place replace compose_device_in_use_info() with snprintf(). > I don't think you need custom implementation of snprintf(). compose_device_in_use_info() was added to handle in a single place the snprintf() return value and the buffer pointer moving. However, you are correct and it is too much here, as the local buffer size is set with a value that should be enough for max possible print. We will remove compose_device_in_use_info() and use snprintf() directly. Thanks! > > > > + print_device_in_use_info(hdev, "User process closed FD but > device still in use"); > > > > hl_device_reset(hdev, HL_DRV_RESET_HARD); > > > > > > You really need reset here ? > > Yes, our h/w requires that we reset the device after the user closed > > it. If the device is not idle after the user closed it, we hard reset > > it. > > If it is idle, we do a more graceful reset. > > Hmm, ok. > > Regards > Stanislaw
On Fri, Feb 17, 2023 at 11:34:39AM +0000, Tomer Tayar wrote: > > > Ok, just place replace compose_device_in_use_info() with snprintf(). > > I don't think you need custom implementation of snprintf(). > > compose_device_in_use_info() was added to handle in a single place the snprintf() return value and the buffer pointer moving. > However, you are correct and it is too much here, as the local buffer size is set with a value that should be enough for max possible print. > We will remove compose_device_in_use_info() and use snprintf() directly. Actually the safer version would be scnprintf() since for that function return value could not be bigger than passed len. Usage then could be as simple as: n += scnprintf(buf + n, len - n, ...); n += scnprintf(buf + n, len - n, ...); Regards Stanislaw
On Thu, Feb 20, 2023 at 17:55 Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> wrote: > On Fri, Feb 17, 2023 at 11:34:39AM +0000, Tomer Tayar wrote: > > > > > Ok, just place replace compose_device_in_use_info() with snprintf(). > > > I don't think you need custom implementation of snprintf(). > > > > compose_device_in_use_info() was added to handle in a single place the > snprintf() return value and the buffer pointer moving. > > However, you are correct and it is too much here, as the local buffer size is set > with a value that should be enough for max possible print. > > We will remove compose_device_in_use_info() and use snprintf() directly. > > Actually the safer version would be scnprintf() since for that function > return value could not be bigger than passed len. Usage then could be > as simple as: > > n += scnprintf(buf + n, len - n, ...); > n += scnprintf(buf + n, len - n, ...); > > Regards > Stanislaw Sure, we will use it, thanks!
diff --git a/drivers/accel/habanalabs/common/command_submission.c b/drivers/accel/habanalabs/common/command_submission.c index e313ff8af7cc..24f3d82ee4cb 100644 --- a/drivers/accel/habanalabs/common/command_submission.c +++ b/drivers/accel/habanalabs/common/command_submission.c @@ -1168,6 +1168,22 @@ static void cs_completion(struct work_struct *work) hl_complete_job(hdev, job); } +u32 hl_get_active_cs_num(struct hl_device *hdev) +{ + u32 active_cs_num = 0; + struct hl_cs *cs; + + spin_lock(&hdev->cs_mirror_lock); + + list_for_each_entry(cs, &hdev->cs_mirror_list, mirror_node) + if (!cs->completed) + active_cs_num++; + + spin_unlock(&hdev->cs_mirror_lock); + + return active_cs_num; +} + static int validate_queue_index(struct hl_device *hdev, struct hl_cs_chunk *chunk, enum hl_queue_type *queue_type, diff --git a/drivers/accel/habanalabs/common/device.c b/drivers/accel/habanalabs/common/device.c index 194c282d7e55..b8c74185eabd 100644 --- a/drivers/accel/habanalabs/common/device.c +++ b/drivers/accel/habanalabs/common/device.c @@ -492,6 +492,52 @@ int hl_hpriv_put(struct hl_fpriv *hpriv) return kref_put(&hpriv->refcount, hpriv_release); } +static void compose_device_in_use_info(char **buf, size_t *buf_size, const char *fmt, ...) +{ + struct va_format vaf; + va_list args; + int size; + + va_start(args, fmt); + vaf.fmt = fmt; + vaf.va = &args; + + size = snprintf(*buf, *buf_size, "%pV", &vaf); + if (size >= *buf_size) + size = *buf_size; + + *buf += size; + *buf_size -= size; + + va_end(args); +} + +static void print_device_in_use_info(struct hl_device *hdev, const char *message) +{ + u32 active_cs_num, dmabuf_export_cnt; + char buf[64], *buf_ptr = buf; + size_t buf_size = sizeof(buf); + bool unknown_reason = true; + + active_cs_num = hl_get_active_cs_num(hdev); + if (active_cs_num) { + unknown_reason = false; + compose_device_in_use_info(&buf_ptr, &buf_size, " [%u active CS]", active_cs_num); + } + + dmabuf_export_cnt = atomic_read(&hdev->dmabuf_export_cnt); + if (dmabuf_export_cnt) { + unknown_reason = false; + compose_device_in_use_info(&buf_ptr, &buf_size, " [%u exported dma-buf]", + dmabuf_export_cnt); + } + + if (unknown_reason) + compose_device_in_use_info(&buf_ptr, &buf_size, " [unknown reason]"); + + dev_notice(hdev->dev, "%s%s\n", message, buf); +} + /* * hl_device_release - release function for habanalabs device * @@ -519,12 +565,11 @@ static int hl_device_release(struct inode *inode, struct file *filp) hdev->compute_ctx_in_release = 1; if (!hl_hpriv_put(hpriv)) { - dev_notice(hdev->dev, "User process closed FD but device still in use\n"); + print_device_in_use_info(hdev, "User process closed FD but device still in use"); hl_device_reset(hdev, HL_DRV_RESET_HARD); } - hdev->last_open_session_duration_jif = - jiffies - hdev->last_successful_open_jif; + hdev->last_open_session_duration_jif = jiffies - hdev->last_successful_open_jif; return 0; } diff --git a/drivers/accel/habanalabs/common/habanalabs.h b/drivers/accel/habanalabs/common/habanalabs.h index d98e6c0feb24..afdae5775eaa 100644 --- a/drivers/accel/habanalabs/common/habanalabs.h +++ b/drivers/accel/habanalabs/common/habanalabs.h @@ -3200,6 +3200,7 @@ struct hl_reset_info { * drams are binned-out * @tpc_binning: contains mask of tpc engines that is received from the f/w which indicates which * tpc engines are binned-out + * @dmabuf_export_cnt: number of dma-buf exporting. * @card_type: Various ASICs have several card types. This indicates the card * type of the current device. * @major: habanalabs kernel driver major. @@ -3371,7 +3372,7 @@ struct hl_device { u64 fw_comms_poll_interval_usec; u64 dram_binning; u64 tpc_binning; - + atomic_t dmabuf_export_cnt; enum cpucp_card_types card_type; u32 major; u32 high_pll; @@ -3664,6 +3665,7 @@ bool cs_needs_timeout(struct hl_cs *cs); bool is_staged_cs_last_exists(struct hl_device *hdev, struct hl_cs *cs); struct hl_cs *hl_staged_cs_find_first(struct hl_device *hdev, u64 cs_seq); void hl_multi_cs_completion_init(struct hl_device *hdev); +u32 hl_get_active_cs_num(struct hl_device *hdev); void goya_set_asic_funcs(struct hl_device *hdev); void gaudi_set_asic_funcs(struct hl_device *hdev); diff --git a/drivers/accel/habanalabs/common/memory.c b/drivers/accel/habanalabs/common/memory.c index 7b3809853dd5..e115d264b03b 100644 --- a/drivers/accel/habanalabs/common/memory.c +++ b/drivers/accel/habanalabs/common/memory.c @@ -1833,6 +1833,7 @@ static void hl_release_dmabuf(struct dma_buf *dmabuf) if (hl_dmabuf->memhash_hnode) memhash_node_export_put(ctx, hl_dmabuf->memhash_hnode); + atomic_dec(&ctx->hdev->dmabuf_export_cnt); hl_ctx_put(ctx); kfree(hl_dmabuf); } @@ -1872,6 +1873,7 @@ static int export_dmabuf(struct hl_ctx *ctx, hl_dmabuf->ctx = ctx; hl_ctx_get(hl_dmabuf->ctx); + atomic_inc(&ctx->hdev->dmabuf_export_cnt); *dmabuf_fd = fd;