Message ID | 1462668011.32105.7.camel@petros-ultrathin (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
On Sat, 7 May 2016, Petros Koutoupis wrote: > The current state of the code checks to see if the reference to > scsi_cmnd is not null, but it never checks to see if it is null and > always assumes it is valid before its use in below switch statement. > This patch addresses that. > There is no sign-off here. > --- linux/drivers/scsi/megaraid/megaraid_sas_fusion.c.orig 2016-05-07 09:12:56.748969851 -0500 > +++ linux/drivers/scsi/megaraid/megaraid_sas_fusion.c 2016-05-07 09:15:29.612967113 -0500 > @@ -2277,6 +2277,10 @@ complete_cmd_fusion(struct megasas_insta > > if (cmd_fusion->scmd) > cmd_fusion->scmd->SCp.ptr = NULL; > + else if ((!cmd_fusion->scmd) && > + ((scsi_io_req->Function == MPI2_FUNCTION_SCSI_IO_REQUEST) || > + (scsi_io_req->Function == MEGASAS_MPI2_FUNCTION_LD_IO_REQUEST))) > + goto next; That contains a tautology. > > scmd_local = cmd_fusion->scmd; > status = scsi_io_req->RaidContext.status; > @@ -2336,7 +2340,7 @@ complete_cmd_fusion(struct megasas_insta > megasas_complete_cmd(instance, cmd_mfi, DID_OK); > break; > } > - > +next: > fusion->last_reply_idx[MSIxIndex]++; > if (fusion->last_reply_idx[MSIxIndex] >= > fusion->reply_q_depth) > > Your patch is equivalent to this one: @@ -2294,6 +2294,8 @@ complete(&cmd_fusion->done); break; case MPI2_FUNCTION_SCSI_IO_REQUEST: /*Fast Path IO.*/ + if (unlikely(!scmd_local)) + break; /* Update load balancing info */ device_id = MEGASAS_DEV_INDEX(scmd_local); lbinfo = &fusion->load_balance_info[device_id]; @@ -2311,6 +2313,8 @@ } /* Fall thru and complete IO */ case MEGASAS_MPI2_FUNCTION_LD_IO_REQUEST: /* LD-IO Path */ + if (unlikely(!scmd_local)) + break; /* Map the FW Cmd Status */ map_cmd_status(cmd_fusion, status, extStatus); scsi_io_req->RaidContext.status = 0; I don't know anything about this driver or hardware so I'm not actually advocating any of these changes, just pointing out that your patch has issues. I would have expected the 'smatch' checker to detect either the potential NULL pointer derefs or alternatively the redundant test for a non-NULL pointer. Maybe it has detected this already (?) or maybe the NULL pointer deref can be shown to be impossible?
On Sun, 2016-05-08 at 13:32 +1000, Finn Thain wrote: > On Sat, 7 May 2016, Petros Koutoupis wrote: > > > The current state of the code checks to see if the reference to > > scsi_cmnd is not null, but it never checks to see if it is null and > > always assumes it is valid before its use in below switch statement. > > This patch addresses that. > > > > There is no sign-off here. > Oops. Will resend. > > --- linux/drivers/scsi/megaraid/megaraid_sas_fusion.c.orig 2016-05-07 09:12:56.748969851 -0500 > > +++ linux/drivers/scsi/megaraid/megaraid_sas_fusion.c 2016-05-07 09:15:29.612967113 -0500 > > @@ -2277,6 +2277,10 @@ complete_cmd_fusion(struct megasas_insta > > > > if (cmd_fusion->scmd) > > cmd_fusion->scmd->SCp.ptr = NULL; > > + else if ((!cmd_fusion->scmd) && > > + ((scsi_io_req->Function == MPI2_FUNCTION_SCSI_IO_REQUEST) || > > + (scsi_io_req->Function == MEGASAS_MPI2_FUNCTION_LD_IO_REQUEST))) > > + goto next; > > That contains a tautology. > How so? The very definition of tautology implies that I am repeating the same kind of functionality in two different ways. The thing is, this function routine does not always need to have a valid scsi_cmnd reference, which is why in the first it checks if it is valid. In the second, it only checks if it is null under the only two cases which make use of it. > > > > scmd_local = cmd_fusion->scmd; > > status = scsi_io_req->RaidContext.status; > > @@ -2336,7 +2340,7 @@ complete_cmd_fusion(struct megasas_insta > > megasas_complete_cmd(instance, cmd_mfi, DID_OK); > > break; > > } > > - > > +next: > > fusion->last_reply_idx[MSIxIndex]++; > > if (fusion->last_reply_idx[MSIxIndex] >= > > fusion->reply_q_depth) > > > > > > Your patch is equivalent to this one: > > @@ -2294,6 +2294,8 @@ > complete(&cmd_fusion->done); > break; > case MPI2_FUNCTION_SCSI_IO_REQUEST: /*Fast Path IO.*/ > + if (unlikely(!scmd_local)) > + break; > /* Update load balancing info */ > device_id = MEGASAS_DEV_INDEX(scmd_local); > lbinfo = &fusion->load_balance_info[device_id]; > @@ -2311,6 +2313,8 @@ > } > /* Fall thru and complete IO */ > case MEGASAS_MPI2_FUNCTION_LD_IO_REQUEST: /* LD-IO Path */ > + if (unlikely(!scmd_local)) > + break; > /* Map the FW Cmd Status */ > map_cmd_status(cmd_fusion, status, extStatus); > scsi_io_req->RaidContext.status = 0; > > Yeah, except your patch introduces duplication of the same code and even worse, if the command starts in the MPI2_FUNCTION_SCSI_IO_REQUEST case, it will check the status of the pointer twice which seems pretty inefficient for systems running such high loads (i.e. the "Fall thru"). > I don't know anything about this driver or hardware so I'm not actually > advocating any of these changes, just pointing out that your patch has > issues. > > I would have expected the 'smatch' checker to detect either the potential > NULL pointer derefs or alternatively the redundant test for a non-NULL > pointer. Maybe it has detected this already (?) or maybe the NULL pointer > deref can be shown to be impossible? > It is very possible. We have had customers in the field hit pretty hard by this. > -- -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, 8 May 2016, Petros Koutoupis wrote: > > > > That contains a tautology. > > > > How so? if (x) /* ... */ else if (!x && (whatever)) /* ... */
On Sun, 2016-05-08 at 22:22 +1000, Finn Thain wrote: > On Sun, 8 May 2016, Petros Koutoupis wrote: > > > > > > > That contains a tautology. > > > > > > > How so? > > if (x) > /* ... */ > else if (!x && (whatever)) > /* ... */ > > -- Thank you but I know the logic of what I wrote. A tautology will yield the same results no matter what the interpretation. That is not a tautology. The two conditionals in my case check different states and serve different purposes. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Petros, On Mon, May 9, 2016 at 2:34 AM, Petros Koutoupis <petros@petroskoutoupis.com> wrote: > On Sun, 2016-05-08 at 22:22 +1000, Finn Thain wrote: >> On Sun, 8 May 2016, Petros Koutoupis wrote: >> >> > > >> > > That contains a tautology. >> > > >> > >> > How so? >> >> if (x) >> /* ... */ >> else if (!x && (whatever)) >> /* ... */ >> >> -- > > Thank you but I know the logic of what I wrote. A tautology > will yield the same results no matter what the interpretation. > That is not a tautology. The two conditionals in my case check > different states and serve different purposes. You're missing the point. Execution will only reach the else branch if "!cmd_fusion->scmd", hence checking that is unnecessary. Removing that test (and all the unnecessary parentheses) will reduce the second test to: else if (scsi_io_req->Function == MPI2_FUNCTION_SCSI_IO_REQUEST || scsi_io_req->Function == MEGASAS_MPI2_FUNCTION_LD_IO_REQUEST) which is much cleaner. Thanks,
On Mon, 2016-05-09 at 11:35 +1000, Julian Calaby wrote: > Hi Petros, > > On Mon, May 9, 2016 at 2:34 AM, Petros Koutoupis > <petros@petroskoutoupis.com> wrote: > > On Sun, 2016-05-08 at 22:22 +1000, Finn Thain wrote: > >> On Sun, 8 May 2016, Petros Koutoupis wrote: > >> > >> > > > >> > > That contains a tautology. > >> > > > >> > > >> > How so? > >> > >> if (x) > >> /* ... */ > >> else if (!x && (whatever)) > >> /* ... */ > >> > >> -- > > > > Thank you but I know the logic of what I wrote. A tautology > > will yield the same results no matter what the interpretation. > > That is not a tautology. The two conditionals in my case check > > different states and serve different purposes. > > You're missing the point. > > Execution will only reach the else branch if "!cmd_fusion->scmd", > hence checking that is unnecessary. Removing that test (and all the > unnecessary parentheses) will reduce the second test to: > > else if (scsi_io_req->Function == MPI2_FUNCTION_SCSI_IO_REQUEST || > scsi_io_req->Function == MEGASAS_MPI2_FUNCTION_LD_IO_REQUEST) > > which is much cleaner. > > Thanks, > Julian, I agree. I will clean up and resend. Thank you. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Smatch doesn't quite catch it because we check "cmd_fusion->scmd" for NULL then assign "scmd_local = cmd_fusion->scmd;" and dereference scmd_local unconditionally... It does catch part of the bug if you have cross function analysis: drivers/scsi/megaraid/megaraid_sas_fusion.c:2318 complete_cmd_fusion() error: we previously assumed 'cmd_fusion->scmd' could be null (see line 2281) But that code was from 2010 so I never reported it to the original author or the list. regards, dan carpenter -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I've updated Smatch to catch these and I'm testing it now. We'll see how it goes. If my quick and dirty method doesn't has too many false positives then it's will take some months before I get around to doing it properly... Thanks for notifying me on this. regards, dan carpenter -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: Dan Carpenter [mailto:dan.carpenter@oracle.com] > Sent: Monday, May 09, 2016 1:36 PM > To: Finn Thain > Cc: Petros Koutoupis; kashyap.desai@avagotech.com; > sumit.saxena@avagotech.com; uday.lingala@avagotech.com; > megaraidlinux.pdl@avagotech.com; linux-scsi@vger.kernel.org > Subject: Re: [PATCH] megaraid: add scsi_cmnd NULL check before use > > Smatch doesn't quite catch it because we check "cmd_fusion->scmd" for NULL > then assign "scmd_local = cmd_fusion->scmd;" and dereference scmd_local > unconditionally... > > It does catch part of the bug if you have cross function analysis: > > drivers/scsi/megaraid/megaraid_sas_fusion.c:2318 complete_cmd_fusion() > error: we previously assumed 'cmd_fusion->scmd' could be null (see line 2281) > > But that code was from 2010 so I never reported it to the original author or the > list. "cmd_fusion->scmd" should not be NULL if scsi_io_req->Function is set to MPI2_FUNCTION_SCSI_IO_REQUEST (OR) MEGASAS_MPI2_FUNCTION_LD_IO_REQUES (inside these two cases only, cmd_fusion->scmd will be dereferenced). If cmd_fusion->scmd is NULL for these "scsi_io_req->Function", that will a BUG and should not continue with other commands processing in that case. > > regards, > dan carpenter -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I'm confused what you are saying. According to Petros, these bugs are causing failures in real life, I was only added to the CC list because Smatch should have warned about them. regards, dan carpenter -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2016-05-09 at 15:18 +0530, Sumit Saxena wrote: > > > > -----Original Message----- > > From: Dan Carpenter [mailto:dan.carpenter@oracle.com] > > Sent: Monday, May 09, 2016 1:36 PM > > To: Finn Thain > > Cc: Petros Koutoupis; kashyap.desai@avagotech.com; > > sumit.saxena@avagotech.com; uday.lingala@avagotech.com; > > megaraidlinux.pdl@avagotech.com; linux-scsi@vger.kernel.org > > Subject: Re: [PATCH] megaraid: add scsi_cmnd NULL check before use > > > > Smatch doesn't quite catch it because we check "cmd_fusion->scmd" for > NULL > > > > then assign "scmd_local = cmd_fusion->scmd;" and dereference scmd_local > > unconditionally... > > > > It does catch part of the bug if you have cross function analysis: > > > > drivers/scsi/megaraid/megaraid_sas_fusion.c:2318 complete_cmd_fusion() > > error: we previously assumed 'cmd_fusion->scmd' could be null (see > line 2281) > > > > > > But that code was from 2010 so I never reported it to the original > author or the > > > > list. > "cmd_fusion->scmd" should not be NULL if scsi_io_req->Function is set to > MPI2_FUNCTION_SCSI_IO_REQUEST (OR) MEGASAS_MPI2_FUNCTION_LD_IO_REQUES > (inside these two cases only, cmd_fusion->scmd will be dereferenced). If > cmd_fusion->scmd is NULL for these "scsi_io_req->Function", that will a > BUG and > should not continue with other commands processing in that case. > Sumit, To clarify, a detection of cmd_fusion->scmd being NULL with scsi_io_req->Function set to MPI2_FUNCTION_SCSI_IO_REQUEST or MEGASAS_MPI2_FUNCTION_LD_IO_REQUEST should instead trigger a BUG() and not attempt to iterate to the next command in the list. Thank you. -- Petros -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: Petros Koutoupis [mailto:petros@petroskoutoupis.com] > Sent: Tuesday, May 10, 2016 2:59 AM > To: Sumit Saxena; Dan Carpenter; Finn Thain > Cc: kashyap.desai@avagotech.com; sumit.saxena@avagotech.com; > uday.lingala@avagotech.com; megaraidlinux.pdl@avagotech.com; linux- > scsi@vger.kernel.org > Subject: Re: [PATCH] megaraid: add scsi_cmnd NULL check before use > > On Mon, 2016-05-09 at 15:18 +0530, Sumit Saxena wrote: > > > > > > -----Original Message----- > > > From: Dan Carpenter [mailto:dan.carpenter@oracle.com] > > > Sent: Monday, May 09, 2016 1:36 PM > > > To: Finn Thain > > > Cc: Petros Koutoupis; kashyap.desai@avagotech.com; > > > sumit.saxena@avagotech.com; uday.lingala@avagotech.com; > > > megaraidlinux.pdl@avagotech.com; linux-scsi@vger.kernel.org > > > Subject: Re: [PATCH] megaraid: add scsi_cmnd NULL check before use > > > > > > Smatch doesn't quite catch it because we check "cmd_fusion->scmd" > > > for > > NULL > > > > > > then assign "scmd_local = cmd_fusion->scmd;" and dereference > > > scmd_local unconditionally... > > > > > > It does catch part of the bug if you have cross function analysis: > > > > > > drivers/scsi/megaraid/megaraid_sas_fusion.c:2318 > > > complete_cmd_fusion() > > > error: we previously assumed 'cmd_fusion->scmd' could be null (see > > line 2281) > > > > > > > > > But that code was from 2010 so I never reported it to the original > > author or the > > > > > > list. > > "cmd_fusion->scmd" should not be NULL if scsi_io_req->Function is set > > to MPI2_FUNCTION_SCSI_IO_REQUEST (OR) > > MEGASAS_MPI2_FUNCTION_LD_IO_REQUES > > (inside these two cases only, cmd_fusion->scmd will be dereferenced). > > If cmd_fusion->scmd is NULL for these "scsi_io_req->Function", that > > will a BUG and should not continue with other commands processing in > > that case. > > > > Sumit, > > To clarify, a detection of cmd_fusion->scmd being NULL with scsi_io_req- > >Function set to MPI2_FUNCTION_SCSI_IO_REQUEST or > MEGASAS_MPI2_FUNCTION_LD_IO_REQUEST should instead trigger a > BUG() and not attempt to iterate to the next command in the list. Thank > you. Petros, WARN_ON() can be used in this case. Upstream may have concerns on using BUG_ON() and also BUG_ON() won't help in this case. In production environment we never encountered this. Thanks, Sumit > > -- > Petros -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2016-05-11 at 15:11 +0530, Sumit Saxena wrote: > > -----Original Message----- > > From: Petros Koutoupis [mailto:petros@petroskoutoupis.com] > > Sent: Tuesday, May 10, 2016 2:59 AM > > To: Sumit Saxena; Dan Carpenter; Finn Thain > > Cc: kashyap.desai@avagotech.com; sumit.saxena@avagotech.com; > > uday.lingala@avagotech.com; megaraidlinux.pdl@avagotech.com; linux- > > scsi@vger.kernel.org > > Subject: Re: [PATCH] megaraid: add scsi_cmnd NULL check before use > > > > On Mon, 2016-05-09 at 15:18 +0530, Sumit Saxena wrote: > > > > > > > > -----Original Message----- > > > > From: Dan Carpenter [mailto:dan.carpenter@oracle.com] > > > > Sent: Monday, May 09, 2016 1:36 PM > > > > To: Finn Thain > > > > Cc: Petros Koutoupis; kashyap.desai@avagotech.com; > > > > sumit.saxena@avagotech.com; uday.lingala@avagotech.com; > > > > megaraidlinux.pdl@avagotech.com; linux-scsi@vger.kernel.org > > > > Subject: Re: [PATCH] megaraid: add scsi_cmnd NULL check before use > > > > > > > > Smatch doesn't quite catch it because we check "cmd_fusion->scmd" > > > > for > > > NULL > > > > > > > > then assign "scmd_local = cmd_fusion->scmd;" and dereference > > > > scmd_local unconditionally... > > > > > > > > It does catch part of the bug if you have cross function analysis: > > > > > > > > drivers/scsi/megaraid/megaraid_sas_fusion.c:2318 > > > > complete_cmd_fusion() > > > > error: we previously assumed 'cmd_fusion->scmd' could be null (see > > > line 2281) > > > > > > > > > > > > But that code was from 2010 so I never reported it to the original > > > author or the > > > > > > > > list. > > > "cmd_fusion->scmd" should not be NULL if scsi_io_req->Function is set > > > to MPI2_FUNCTION_SCSI_IO_REQUEST (OR) > > > MEGASAS_MPI2_FUNCTION_LD_IO_REQUES > > > (inside these two cases only, cmd_fusion->scmd will be dereferenced). > > > If cmd_fusion->scmd is NULL for these "scsi_io_req->Function", that > > > will a BUG and should not continue with other commands processing in > > > that case. > > > > > > > Sumit, > > > > To clarify, a detection of cmd_fusion->scmd being NULL with scsi_io_req- > > >Function set to MPI2_FUNCTION_SCSI_IO_REQUEST or > > MEGASAS_MPI2_FUNCTION_LD_IO_REQUEST should instead trigger a > > BUG() and not attempt to iterate to the next command in the list. Thank > > you. > > Petros, > > WARN_ON() can be used in this case. Upstream may have concerns on using > BUG_ON() and also BUG_ON() won't help in this case. In production > environment we never encountered this. > > Thanks, > Sumit > > > > -- > > Petros Sumit, I will resubmit the patch with all the recommendations. Thank you. In case you are interested, I have a crash file showcasing the error. I can always provide this outside of this mailing thread. -- Petros -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, May 11, 2016 at 08:49:51PM -0500, Petros Koutoupis wrote: > Sumit, > > I will resubmit the patch with all the recommendations. Thank you. In case > you are interested, I have a crash file showcasing the error. I can always > provide this outside of this mailing thread. > Please send it to the thread. To be honest, I totally can't understand this thread. Sumit says it is impossible and you are saying that you have seen it happen in real life. Are you using out of tree code or something? What are you doing that is unexpected? I don't see the point of adding a WARN_ON(). NULL derefs normally generate a pretty clear stack trace already (unless they are caused by memory corruption). Why are we not just fixing the bugs instead of warning and then crashing. Also when I'm doing static analysis people always tell me that "that bug is impossible, trust me." and instead of trusting people I really wish they would just show me the relevant code that prevents it from happening. regards, dan carpenter -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: Dan Carpenter [mailto:dan.carpenter@oracle.com] > Sent: Thursday, May 12, 2016 12:05 PM > To: Petros Koutoupis > Cc: Sumit Saxena; Finn Thain; kashyap.desai@avagotech.com; > sumit.saxena@avagotech.com; uday.lingala@avagotech.com; > megaraidlinux.pdl@avagotech.com; linux-scsi@vger.kernel.org > Subject: Re: [PATCH] megaraid: add scsi_cmnd NULL check before use > > On Wed, May 11, 2016 at 08:49:51PM -0500, Petros Koutoupis wrote: > > Sumit, > > > > I will resubmit the patch with all the recommendations. Thank you. In > > case you are interested, I have a crash file showcasing the error. I > > can always provide this outside of this mailing thread. > > > > Please send it to the thread. > > To be honest, I totally can't understand this thread. Sumit says it is impossible > and you are saying that you have seen it happen in real life. > Are you using out of tree code or something? What are you doing that is > unexpected? > > I don't see the point of adding a WARN_ON(). NULL derefs normally generate a > pretty clear stack trace already (unless they are caused by memory corruption). > Why are we not just fixing the bugs instead of warning and then crashing. Agree, if there scsi_cmnd is coming as NULL, please attach logs. I will look into them. > > Also when I'm doing static analysis people always tell me that "that bug is > impossible, trust me." and instead of trusting people I really wish they would just > show me the relevant code that prevents it from happening. Inside megasas_build_io_fusion() function, driver sets "cmd->scmd" pointer(SCSI command pointer received from SCSI mid layer). Functions called inside megasas_build_io_fusion()(which actually builds frame to be sent to firmware) are setting Function type- MPI2_FUNCTION_SCSI_IO_REQUEST (or) MEGASAS_MPI2_FUNCTION_LD_IO_REQUEST. So in case Function type set to any one these two, there must be valid "cmd->scmd". Thanks, Sumit > > regards, > dan carpenter -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 12 May 2016, Sumit Saxena wrote: > > From: Dan Carpenter [mailto:dan.carpenter@oracle.com] > > > > Also when I'm doing static analysis people always tell me that "that > > bug is impossible, trust me." and instead of trusting people I really > > wish they would just show me the relevant code that prevents it from > > happening. > > Inside megasas_build_io_fusion() function, driver sets "cmd->scmd" > pointer(SCSI command pointer received from SCSI mid layer). Functions > called inside megasas_build_io_fusion()(which actually builds frame to > be sent to firmware) are setting Function type- > MPI2_FUNCTION_SCSI_IO_REQUEST (or) MEGASAS_MPI2_FUNCTION_LD_IO_REQUEST. > So in case Function type set to any one these two, there must be valid > "cmd->scmd". That doesn't show what prevents the bug. It merely shows that the bug does not always manifest. For example, you might check whether anything prevents megasas_build_io_fusion() from returning before assigning cmd->scmd, like so: 2112 if (sge_count > instance->max_num_sge) { 2113 dev_err(&instance->pdev->dev, "Error. sge_count (0x%x) exceeds " 2114 "max (0x%x) allowed\n", sge_count, 2115 instance->max_num_sge); 2116 return 1; 2117 } Another possibility: cmd->io_request->Function is valid yet cmd->scmd is NULL when seen from the interrupt handler if it intervenes between the two statements in megasas_return_cmd_fusion(): 180 inline void megasas_return_cmd_fusion(struct megasas_instance *instance, 181 struct megasas_cmd_fusion *cmd) 182 { 183 cmd->scmd = NULL; 184 memset(cmd->io_request, 0, sizeof(struct MPI2_RAID_SCSI_IO_REQUEST)); 185 } You might want to confirm that locking always prevents that. OTOH, without an actual backtrace, I too might be reluctant to pursue this kind of speculation.
> -----Original Message----- > From: Finn Thain [mailto:fthain@telegraphics.com.au] > Sent: Friday, May 13, 2016 1:14 PM > To: Sumit Saxena > Cc: Dan Carpenter; Petros Koutoupis; kashyap.desai@avagotech.com; > sumit.saxena@avagotech.com; uday.lingala@avagotech.com; > megaraidlinux.pdl@avagotech.com; linux-scsi@vger.kernel.org > Subject: RE: [PATCH] megaraid: add scsi_cmnd NULL check before use > > > On Thu, 12 May 2016, Sumit Saxena wrote: > > > > From: Dan Carpenter [mailto:dan.carpenter@oracle.com] > > > > > > Also when I'm doing static analysis people always tell me that "that > > > bug is impossible, trust me." and instead of trusting people I > > > really wish they would just show me the relevant code that prevents > > > it from happening. > > > > Inside megasas_build_io_fusion() function, driver sets "cmd->scmd" > > pointer(SCSI command pointer received from SCSI mid layer). Functions > > called inside megasas_build_io_fusion()(which actually builds frame to > > be sent to firmware) are setting Function type- > > MPI2_FUNCTION_SCSI_IO_REQUEST (or) > MEGASAS_MPI2_FUNCTION_LD_IO_REQUEST. > > So in case Function type set to any one these two, there must be valid > > "cmd->scmd". > > That doesn't show what prevents the bug. It merely shows that the bug does not > always manifest. > > For example, you might check whether anything prevents > megasas_build_io_fusion() from returning before assigning cmd->scmd, like > so: > > 2112 if (sge_count > instance->max_num_sge) { > 2113 dev_err(&instance->pdev->dev, "Error. sge_count (0x%x) > exceeds " > 2114 "max (0x%x) allowed\n", sge_count, > 2115 instance->max_num_sge); > 2116 return 1; > 2117 } > > Another possibility: cmd->io_request->Function is valid yet cmd->scmd is NULL > when seen from the interrupt handler if it intervenes between the two > statements in megasas_return_cmd_fusion(): For IOs returned by above error are not actually fired to firmware so there will be no interrupt handler called for the same. Anyways, if anyone has logs pertaining to the failure, please attach.. > > 180 inline void megasas_return_cmd_fusion(struct megasas_instance *instance, > 181 struct megasas_cmd_fusion *cmd) > 182 { > 183 cmd->scmd = NULL; > 184 memset(cmd->io_request, 0, sizeof(struct > MPI2_RAID_SCSI_IO_REQUEST)); > 185 } > > You might want to confirm that locking always prevents that. > > OTOH, without an actual backtrace, I too might be reluctant to pursue this kind > of speculation. > > -- -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 13 May 2016, Sumit Saxena wrote: > > For IOs returned by above error are not actually fired to firmware so > there will be no interrupt handler called for the same. You don't need both possibilities to hold. Either one would do it!
On Thu, 2016-05-12 at 09:35 +0300, Dan Carpenter wrote: > On Wed, May 11, 2016 at 08:49:51PM -0500, Petros Koutoupis wrote: > > Sumit, > > > > I will resubmit the patch with all the recommendations. Thank you. In case > > you are interested, I have a crash file showcasing the error. I can always > > provide this outside of this mailing thread. > > > > Please send it to the thread. > > To be honest, I totally can't understand this thread. Sumit says it is > impossible and you are saying that you have seen it happen in real life. > Are you using out of tree code or something? What are you doing that is > unexpected? > > I don't see the point of adding a WARN_ON(). NULL derefs normally > generate a pretty clear stack trace already (unless they are caused by > memory corruption). Why are we not just fixing the bugs instead of > warning and then crashing. > > Also when I'm doing static analysis people always tell me that "that bug > is impossible, trust me." and instead of trusting people I really wish > they would just show me the relevant code that prevents it from > happening. > > regards, > dan carpenter > All, You will find the tarball containing the crash and kernel images here: https://drive.google.com/open?id=0B771xrWtHcpWWFI3Yl93WFBfT28 Note that it is a 7GB file. -- Petros -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--- linux/drivers/scsi/megaraid/megaraid_sas_fusion.c.orig 2016-05-07 09:12:56.748969851 -0500 +++ linux/drivers/scsi/megaraid/megaraid_sas_fusion.c 2016-05-07 09:15:29.612967113 -0500 @@ -2277,6 +2277,10 @@ complete_cmd_fusion(struct megasas_insta if (cmd_fusion->scmd) cmd_fusion->scmd->SCp.ptr = NULL; + else if ((!cmd_fusion->scmd) && + ((scsi_io_req->Function == MPI2_FUNCTION_SCSI_IO_REQUEST) || + (scsi_io_req->Function == MEGASAS_MPI2_FUNCTION_LD_IO_REQUEST))) + goto next; scmd_local = cmd_fusion->scmd; status = scsi_io_req->RaidContext.status; @@ -2336,7 +2340,7 @@ complete_cmd_fusion(struct megasas_insta megasas_complete_cmd(instance, cmd_mfi, DID_OK); break; } - +next: fusion->last_reply_idx[MSIxIndex]++; if (fusion->last_reply_idx[MSIxIndex] >= fusion->reply_q_depth)