Message ID | 1700864395-1479-4-git-send-email-quic_mojha@quicinc.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Add Qualcomm Minidump driver related support | expand |
On Sat, Nov 25, 2023 at 03:49:46AM +0530, Mukesh Ojha wrote: > diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst > index 43ea35613dfc..251d070486c2 100644 > --- a/Documentation/admin-guide/index.rst > +++ b/Documentation/admin-guide/index.rst > @@ -120,6 +120,7 @@ configure specific aspects of kernel behavior to your liking. > perf-security > pm/index > pnp > + qcom_minidump > rapidio > ras > rtc > diff --git a/Documentation/admin-guide/qcom_minidump.rst b/Documentation/admin-guide/qcom_minidump.rst > new file mode 100644 > index 000000000000..b492f2b79639 > --- /dev/null > +++ b/Documentation/admin-guide/qcom_minidump.rst > @@ -0,0 +1,272 @@ > +Qualcomm minidump feature > +========================= > + > +Introduction > +------------ > + > +Minidump is a best effort mechanism to collect useful and predefined > +data for first level of debugging on end user devices running on > +Qualcomm SoCs. It is built on the premise that System on Chip (SoC) > +or subsystem part of SoC crashes, due to a range of hardware and > +software bugs. Hence, the ability to collect accurate data is only > +a best-effort. The data collected could be invalid or corrupted, data > +collection itself could fail, and so on. > + > +Qualcomm devices in engineering mode provides a mechanism for generating > +full system RAM dumps for post-mortem debugging. But in some cases it's > +however not feasible to capture the entire content of RAM. The minidump > +mechanism provides the means for selected region should be included in > +the ramdump. > + > + > +:: > + > + +-----------------------------------------------+ > + | DDR +-------------+ | > + | | SS0-ToC| | > + | +----------------+ +----------------+ | | > + | |Shared memory | | SS1-ToC| | | > + | |(SMEM) | | | | | > + | | | +-->|--------+ | | | > + | |G-ToC | | | SS-ToC \ | | | > + | |+-------------+ | | | +-----------+ | | | > + | ||-------------| | | | |-----------| | | | > + | || SS0-ToC | | | +-|<|SS1 region1| | | | > + | ||-------------| | | | | |-----------| | | | > + | || SS1-ToC |-|>+ | | |SS1 region2| | | | > + | ||-------------| | | | |-----------| | | | > + | || SS2-ToC | | | | | ... | | | | > + | ||-------------| | | | |-----------| | | | > + | || ... | | |-|<|SS1 regionN| | | | > + | ||-------------| | | | |-----------| | | | > + | || SSn-ToC | | | | +-----------+ | | | > + | |+-------------+ | | | | | | > + | | | | |----------------| | | > + | | | +>| regionN | | | > + | | | | |----------------| | | > + | +----------------+ | | | | | > + | | |----------------| | | > + | +>| region1 | | | > + | |----------------| | | > + | | | | | > + | |----------------|-+ | > + | | region5 | | > + | |----------------| | > + | | | | > + | Region information +----------------+ | > + | +---------------+ | > + | |region name | | > + | |---------------| | > + | |region address | | > + | |---------------| | > + | |region size | | > + | +---------------+ | > + +-----------------------------------------------+ > + G-ToC: Global table of contents > + SS-ToC: Subsystem table of contents > + SS0-SSn: Subsystem numbered from 0 to n > + > +It depends on SoC where the underlying firmware is keeping the > +minidump global table taking care of subsystem ToC part for > +minidump like for above diagram, it is for shared memory sitting > +in DDR and it is shared among various master however it is possible > +that this could be implemented via memory mapped regions but the > +general idea should remain same. Here, various subsystem could be > +DSP's like ADSP/CDSP/MODEM etc, along with Application processor > +(APSS) where Linux runs. DSP minidump gets collected when DSP's goes > +for recovery followed by a crash. The minidump part of code for > +that resides in ``qcom_rproc_minidump.c``. > + > + > +SMEM as backend > +---------------- > + > +In this document, SMEM will be used as the backend implementation > +of minidump. > + > +The core of minidump feature is part of Qualcomm's boot firmware code. > +It initializes shared memory (SMEM), which is a part of DDR and > +allocates a small section of it to minidump table, i.e. also called > +global table of contents (G-ToC). Each subsystem (APSS, ADSP, ...) has > +its own table of segments to be included in the minidump, all > +references from a descriptor in SMEM (G-ToC). Each segment/region has > +some details like name, physical address and its size etc. and it > +could be anywhere scattered in the DDR. > + > +Qualcomm APSS Minidump kernel driver concept > +-------------------------------------------- > + > +Qualcomm APSS minidump kernel driver adds the capability to add Linux > +region to be dumped as part of RAM dump collection. At the moment, > +shared memory driver creates platform device for minidump driver and > +give a means to APSS minidump to initialize itself on probe. > + > +This driver provides ``qcom_minidump_region_register`` and > +``qcom_minidump_region_unregister`` API's to register and unregister > +APSS minidump region. It also supports registration for the clients > +who came before minidump driver was initialized. It maintains pending > +list of clients who came before minidump and once minidump is initialized > +it registers them in one go. > + > +To simplify post-mortem debugging, driver creates and maintain an ELF > +header as first region that gets updated each time a new region gets > +registered. > + > +The solution supports extracting the RAM dump/minidump produced either > +over USB or stored to an attached storage device. > + > +Dependency of minidump kernel driver > +------------------------------------ > + > +It is to note that whole of minidump depends on Qualcomm boot firmware > +whether it supports minidump or not. So, if the minidump SMEM ID is > +present in shared memory, it indicates that minidump is supported from > +boot firmware and it is possible to dump Linux (APSS) region as part > +of minidump collection. > + > +How a kernel client driver can register region with minidump > +------------------------------------------------------------ > + > +Client driver can use ``qcom_minidump_region_register`` API's to register > +and ``qcom_minidump_region_unregister`` to unregister their region from > +minidump driver. > + > +Client needs to fill their region by filling ``qcom_minidump_region`` > +structure object which consists of the region name, region's virtual > +and physical address and its size. > + > +Below, is one sample client driver snippet which tries to allocate a > +region from kernel heap of certain size and it writes a certain known > +pattern (that can help in verification after collection that we got > +the exact pattern, what we wrote) and registers it with minidump. > + > + .. code-block:: c > + > + #include <soc/qcom/qcom_minidump.h> > + [...] > + > + > + [... inside a function ...] > + struct qcom_minidump_region region; > + > + [...] > + > + client_mem_region = kzalloc(region_size, GFP_KERNEL); > + if (!client_mem_region) > + return -ENOMEM; > + > + [... Just write a pattern ...] > + memset(client_mem_region, 0xAB, region_size); > + > + [... Fill up the region object ...] > + strlcpy(region.name, "REGION_A", sizeof(region.name)); > + region.virt_addr = client_mem_region; > + region.phys_addr = virt_to_phys(client_mem_region); > + region.size = region_size; > + > + ret = qcom_minidump_region_register(®ion); > + if (ret < 0) { > + pr_err("failed to add region in minidump: err: %d\n", ret); > + return ret; > + } > + > + [...] > + > + > +Test > +---- > + > +Existing Qualcomm devices already supports entire RAM dump (also called > +full dump) by writing appropriate value to Qualcomm's top control and > +status register (tcsr) in ``driver/firmware/qcom_scm.c`` . > + > +SCM device Tree bindings required to support download mode > +For example (sm8450) :: > + > + / { > + > + [...] > + > + firmware { > + scm: scm { > + compatible = "qcom,scm-sm8450", "qcom,scm"; > + [... tcsr register ... ] > + qcom,dload-mode = <&tcsr 0x13000>; > + > + [...] > + }; > + }; > + > + [...] > + > + soc: soc@0 { > + > + [...] > + > + tcsr: syscon@1fc0000 { > + compatible = "qcom,sm8450-tcsr", "syscon"; > + reg = <0x0 0x1fc0000 0x0 0x30000>; > + }; > + > + [...] > + }; > + [...] > + > + }; > + > +User of minidump can pass ``qcom_scm.download_mode="mini"`` to kernel > +commandline to set the current download mode to minidump. > +Similarly, ``"full"`` is passed to set the download mode to full dump > +where entire RAM dump will be collected while setting it ``"full,mini"`` > +will collect minidump along with fulldump. > + > +Writing to sysfs node can also be used to set the mode to minidump:: > + > + echo "mini" > /sys/module/qcom_scm/parameter/download_mode > + > +Once the download mode is set, any kind of crash will make the device collect > +respective dump as per set download mode. > + > +Dump collection > +--------------- > +:: > + > + +-----------+ > + | | > + | | +------+ > + | | | | > + | | +--+---+ Product(Qualcomm SoC) > + +-----------+ | > + |+++++++++++|<------------+ > + |+++++++++++| usb cable > + +-----------+ > + x86_64 PC > + > +The solution supports a product running with Qualcomm SoC (where minidump) > +is supported from the firmware) connected to x86_64 host PC running PCAT > +tool. It supports downloading the minidump produced from product to the > +host PC over USB or to save the minidump to the product attached storage > +device(UFS/eMMC/SD Card) into minidump dedicated partition. > + > +By default, dumps are downloaded via USB to the attached x86_64 PC running > +PCAT (Qualcomm tool) software. Upon download, we will see a set of binary > +blobs starting with name ``md_*`` in PCAT configured directory in x86_64 > +machine, so for above example from the client it will be ``md_REGION_A.BIN``. > +This binary blob depends on region content to determine whether it needs > +external parser support to get the content of the region, so for simple > +plain ASCII text we don't need any parsing and the content can be seen > +just opening the binary file. > + > +To collect the dump to attached storage type, one needs to write appropriate > +value to IMEM register, in that case dumps are collected in rawdump > +partition on the product device itself. > + > +One needs to read the entire rawdump partition and pull out content to > +save it onto the attached x86_64 machine over USB. Later, this rawdump > +can be passed to another tool (``dexter.exe`` [Qualcomm tool]) which > +converts this into the similar binary blobs which we have got it when > +download type was set to USB, i.e. a set of registered regions as blobs > +and their name starts with ``md_*``. > + > +Replacing the ``dexter.exe`` with some open source tool can be added as future > +scope of this document. LGTM, thanks! Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
On 24/11/2023 22:19, Mukesh Ojha wrote: > Add the qualcomm minidump guide for the users which tries to cover > the dependency, API use and the way to test and collect minidump > on Qualcomm supported SoCs. > > Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com> > --- > Documentation/admin-guide/index.rst | 1 + > Documentation/admin-guide/qcom_minidump.rst | 272 ++++++++++++++++++++++++++++ > 2 files changed, 273 insertions(+) > create mode 100644 Documentation/admin-guide/qcom_minidump.rst General nit-pick on sequencing of your patches. Its good practice (tm) to put your documentation first in your series. > diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst > index 43ea35613dfc..251d070486c2 100644 > --- a/Documentation/admin-guide/index.rst > +++ b/Documentation/admin-guide/index.rst > @@ -120,6 +120,7 @@ configure specific aspects of kernel behavior to your liking. > perf-security > pm/index > pnp > + qcom_minidump > rapidio > ras > rtc > diff --git a/Documentation/admin-guide/qcom_minidump.rst b/Documentation/admin-guide/qcom_minidump.rst > new file mode 100644 > index 000000000000..b492f2b79639 > --- /dev/null > +++ b/Documentation/admin-guide/qcom_minidump.rst > @@ -0,0 +1,272 @@ > +Qualcomm minidump feature > +========================= > + > +Introduction > +------------ > + > +Minidump is a best effort mechanism to collect useful and predefined > +data for first level of debugging on end user devices running on > +Qualcomm SoCs. What does "first-level debugging" mean here ? You use the term "post-mortem" later, stick with one term for consistency throughout your document. Suggest: "Minidump is a best-effort mechanism to collect useful predefined data for post-mortem debugging on a Qualcomm System on Chips (SoCs)." Generally better to stick with "Qualcomm SoC" and "Minidump" once you establish the terms upfront and early in your text - instead of reverting to "device" and "it". It is built on the premise that System on Chip (SoC) > +or subsystem part of SoC crashes, due to a range of hardware and > +software bugs. Instead of saying "It" say "Minidump" Hence, the ability to collect accurate data is only > +a best-effort. The data collected could be invalid or corrupted, data > +collection itself could fail, and so on. "and so on" is a redundancy, drop. Suggest: Minidump is built on the premise that a hardware or software component on the SoC has encountered an unexpected fault. This means that the data collected by minidump cannot be assumed to be correct or even present. > + > +Qualcomm devices in engineering mode provides a mechanism for generating > +full system RAM dumps for post-mortem debugging. Stick with the established naming convention "Qualcomm SoCs in engineering mode provide a mechanism for generating full system RAM dumps for this post-mortem debugging" But in some cases it's > +however not feasible to capture the entire content of RAM. The minidump > +mechanism provides the means for selected region should be included in > +the ramdump. Dont' start a sentence with "But" You're also not being clear what you mean by "the entire content of RAM" since you obviously can't capture a full snap-shot of DRAM and store in DRAM. Better to say IMO "Minidump captures specific pre-defined regions of RAM and stores those regions in a reserved Minidump specific buffer." "The region of RAM used to store Minidump data shall be referred to as SMEM throughout this document." [1] > +:: > + > + +-----------------------------------------------+ > + | DDR +-------------+ | Instead of saying "DDR" just mark this as RAM or Memory. Its a terrible nit-pick from me but DDR = "Double Data Rate Synchronous Dynamic Random-Access Memory" but that's irrelevant to this specification. We could be living in a gigantic SRAM for argument sake. > + | | SS0-ToC| | > + | +----------------+ +----------------+ | | > + | |Shared memory | | SS1-ToC| | | > + | |(SMEM) | | | | | > + | | | +-->|--------+ | | | > + | |G-ToC | | | SS-ToC \ | | | > + | |+-------------+ | | | +-----------+ | | | > + | ||-------------| | | | |-----------| | | | > + | || SS0-ToC | | | +-|<|SS1 region1| | | | > + | ||-------------| | | | | |-----------| | | | > + | || SS1-ToC |-|>+ | | |SS1 region2| | | | > + | ||-------------| | | | |-----------| | | | > + | || SS2-ToC | | | | | ... | | | | > + | ||-------------| | | | |-----------| | | | > + | || ... | | |-|<|SS1 regionN| | | | > + | ||-------------| | | | |-----------| | | | > + | || SSn-ToC | | | | +-----------+ | | | > + | |+-------------+ | | | | | | > + | | | | |----------------| | | > + | | | +>| regionN | | | > + | | | | |----------------| | | > + | +----------------+ | | | | | > + | | |----------------| | | > + | +>| region1 | | | > + | |----------------| | | > + | | | | | > + | |----------------|-+ | > + | | region5 | | > + | |----------------| | > + | | | | > + | Region information +----------------+ | > + | +---------------+ | > + | |region name | | > + | |---------------| | > + | |region address | | > + | |---------------| | > + | |region size | | > + | +---------------+ | > + +-----------------------------------------------+ > + G-ToC: Global table of contents > + SS-ToC: Subsystem table of contents > + SS0-SSn: Subsystem numbered from 0 to n You need to be more consistent here SSX-ToC -> SSX-SSn Where X is an integer from 0 upwards and similarly n is a integer from 0 upwards. This name SSX-SSn is not especially descriptive, I'm not sure if this is a name you are choosing here ? If so then let me suggest a new name like "Subsystem Memory Segment" SSX-MSn -> G-ToC: Global table of contents SSX-ToC: Subsystem X table of contents. X is an integer in the range of 0 to ? Is there an upper limit ? Presumably this is an 8, 16, 32 or 64 bit integer Please define either the size of the integer or the valid range of values 0..128, 0..256 SSX-MSn: Subsystem numbered from 0 to n Same comment for the 'n' here. > + > +It depends on SoC where the underlying firmware is keeping the > +minidump global table taking care of subsystem ToC part for > +minidump like for above diagram, it is for shared memory sitting > +in DDR and it is shared among various master however it is possible > +that this could be implemented via memory mapped regions but the > +general idea should remain same. Here, various subsystem could be > +DSP's like ADSP/CDSP/MODEM etc, along with Application processor > +(APSS) where Linux runs. DSP minidump gets collected when DSP's goes > +for recovery followed by a crash. The minidump part of code for > +that resides in ``qcom_rproc_minidump.c``. This paragraph is difficult to parse. What you are describing here is a linked list, I think you should have a paragraph describing how the memory structure works -> "Minidump determines which areas of DRAM to capture via a Minidump defined linked-list structure. At the top level a Global Table of Contents (GTOC) enumerates a variable number of SubSystem Table Of Contents (SSTOC) structures. Each SSTOC contains a list of SubSystem Memory Segements which are named according to the containing SSTOC hence (SSX-MSn) where "X" denotes the SystemSystem index of the containing SSX-ToC and "n" denotes an individual Memory Segment within the SystemSystem. Hence SS0-MS0 belongs to SS0-ToC whereas SS1-MS0 belongs to SS1-ToC." Then I think you can describe how the crash dump colleciton works and which agents of the system - DSP ? is responsible for collecting the crashdump -> "The Application Processor SubSystem (APSS) runs the Linux kernel and is therefore not responsible for assembling Minidump data. One of the other system agents in the SoC will be responsible for capturing the Minidump data during system reset. Typically one of the SoC Digital Signal Processors (DSP) will be used for this purpose. During reset the DSP will walk the GTOC, SSX-ToCs and SSX-MSns populating the Minidump RAM area with the indicated memory" > + > + > +SMEM as backend > +---------------- > + > +In this document, SMEM will be used as the backend implementation > +of minidump. [1] As per the above link, you need to introduce the term SMEM earlier. It's fine to expand on its meaning later but, do please define it once upfront before you use it in your awesome ASCII art. > +The core of minidump feature is part of Qualcomm's boot firmware code. > +It initializes shared memory (SMEM), which is a part of DDR and > +allocates a small section of it to minidump table, i.e. also called > +global table of contents (G-ToC). Each subsystem (APSS, ADSP, ...) has > +its own table of segments to be included in the minidump, all > +references from a descriptor in SMEM (G-ToC). Each segment/region has > +some details like name, physical address and its size etc. and it > +could be anywhere scattered in the DDR. -> "The SoC's bootloader must reserve an area of RAM as SMEM prior to handing over control to the run-time operating system. The bootloader is responsible to place the GTOC at the starting address of SMEM." If you want to give more technical details of size, physical address - then explicitly define those in the section above which talks about the linked-list structure. Please try to avoid use of "etc" or "and so on" since it assumes the reader already knows how the system works and can fill in the blanks but, what you are doing here is educating a Minidump novice in how things work. > + > +Qualcomm APSS Minidump kernel driver concept > +-------------------------------------------- > + > +Qualcomm APSS minidump kernel driver adds the capability to add Linux So why "Minidump" and then "minidump" choose one. > +region to be dumped as part of RAM dump collection. OK so this really is the "meat" of the system. The bootloader/firmware populates the GTOC. The Q this document should probably answer is how the kernel driver knows how/where to place its data. Assumed to be parsing the DTB. At the moment, > +shared memory driver creates platform device for minidump driver and > +give a means to APSS minidump to initialize itself on probe. "At the moment" is another drop. Just make a clear statement "The minidump platform driver populates the APSS porition of the GTOC" more interesting to me is - are there defined numbers, identifiers for the APSS ? or do we just add new entries to the GTOC ? ie. is there a reserved index or "type" in the GTOC that identifies where the APSS needs to insert itself ? > +This driver provides ``qcom_minidump_region_register`` and > +``qcom_minidump_region_unregister`` API's to register and unregister > +APSS minidump region. Why does it do that ? Is it not the case that the driver knows where the APSS data goes ? It also supports registration for the clients > +who came before minidump driver was initialized. It maintains pending > +list of clients who came before minidump and once minidump is initialized > +it registers them in one go. Don't start sentences with "It" -> "The driver" or "Minidump" As I read this though, the Minidump driver in Linux isn't just registering / managing the APSS side of things but also "doing stuff" for other clients ? How does the Linux driver know what to register ? > + > +To simplify post-mortem debugging, driver creates and maintain an ELF the driver creates and maintains > +header as first region that gets updated each time a new region gets > +registered. as the first region So - who is registering these regions ? Linux kernel drivers ? aDSP / cDSP ? If I write a new driver for Venus or GPU can I define my own region(s) to be captured ? Presumably. Please give more detail on this. > + > +The solution supports extracting the RAM dump/minidump produced either > +over USB or stored to an attached storage device. What provides that functionality ? The bootloader ? How do you trigger / capture that dump from the bootloader ? No need to go into super-detail but give some idea. > + > +Dependency of minidump kernel driver > +------------------------------------ > + > +It is to note that whole of minidump depends on Qualcomm boot firmware > +whether it supports minidump or not. You can drop this - you've already stated the bootloader/firmware must setup the initial table so, you're not providing additional information with this statement. > So, if the minidump SMEM ID is Try not to start sentences with "So" SMEM ID ? This is your first time using this term - please relate it back to your ASCII diagram and the description you give with that text. > +present in shared memory, it indicates that minidump is supported from > +boot firmware and it is possible to dump Linux (APSS) region as part > +of minidump collection. If _which_ SMEM ID ? It seems to me as if we are missing some important information here - what are the list of SMEM IDs ? Are these IDs serial and incrementing across SoC versions or SoC specific ? You need to define that data. > +How a kernel client driver can register region with minidump > +------------------------------------------------------------ Answering yes to my earlier question. A driver I write can make use of the API you are providing here. Great. Please give some indication of that earlier, even if its a reference to this description you give here "See X.Y later in this document" > + > +Client driver can use ``qcom_minidump_region_register`` API's to register > +and ``qcom_minidump_region_unregister`` to unregister their region from > +minidump driver. > + > +Client needs to fill their region by filling ``qcom_minidump_region`` > +structure object which consists of the region name, region's virtual > +and physical address and its size. Nit pick. You need a definite article here "A client driver" etc. > + > +Below, is one sample client driver snippet which tries to allocate a > +region from kernel heap of certain size and it writes a certain known > +pattern. Good (that can help in verification after collection that we got > +the exact pattern, what we wrote) and registers it with minidump. Not necessary to define this. We are all smart here and by now the intent of the mechanism is defined.. > + > + .. code-block:: c > + > + #include <soc/qcom/qcom_minidump.h> > + [...] > + > + > + [... inside a function ...] > + struct qcom_minidump_region region; > + > + [...] > + > + client_mem_region = kzalloc(region_size, GFP_KERNEL); > + if (!client_mem_region) > + return -ENOMEM; > + > + [... Just write a pattern ...] > + memset(client_mem_region, 0xAB, region_size); > + > + [... Fill up the region object ...] > + strlcpy(region.name, "REGION_A", sizeof(region.name)); > + region.virt_addr = client_mem_region; > + region.phys_addr = virt_to_phys(client_mem_region); > + region.size = region_size; > + > + ret = qcom_minidump_region_register(®ion); > + if (ret < 0) { > + pr_err("failed to add region in minidump: err: %d\n", ret); > + return ret; > + } > + > + [...] > + > + > +Test > +---- Testing > + > +Existing Qualcomm devices already supports entire RAM dump (also called > +full dump) by writing appropriate value to Qualcomm's top control and > +status register (tcsr) in ``driver/firmware/qcom_scm.c`` . "Existing Qualcomm SoCs already support dumping the entire RAM to the SMEM area/segment/whatever" This is 100% counter-intuitive since SMEM lives in RAM, correct ? Full dump means what, a full dump of the APSS RAM ? What happens if SMEM cannot accommodate the full APSS RAM dump ? > + > +SCM device Tree bindings required to support download mode > +For example (sm8450) :: > + > + / { > + > + [...] > + > + firmware { > + scm: scm { > + compatible = "qcom,scm-sm8450", "qcom,scm"; > + [... tcsr register ... ] > + qcom,dload-mode = <&tcsr 0x13000>; > + > + [...] > + }; > + }; > + > + [...] > + > + soc: soc@0 { > + > + [...] > + > + tcsr: syscon@1fc0000 { > + compatible = "qcom,sm8450-tcsr", "syscon"; > + reg = <0x0 0x1fc0000 0x0 0x30000>; > + }; > + > + [...] > + }; > + [...] > + > + }; > + > +User of minidump can pass ``qcom_scm.download_mode="mini"`` to kernel > +commandline to set the current download mode to minidump. "A kernel command line parameter is provided ``qcom_scm.download_mode="mini"`` to facilitate ... but you aren't telling us what "minidump" captures "the current download" ? do you mean the current state ? Does the system continue to boot up if you pass qcom_scm.download_mode="mini ? will additional registrations to SMEM/Minidump work ? What happens to the minidump data if there is a _subsequent_ real crashdump ? Overwritten ? Also what happens if SMEM runs out of space ? Say I boot with ``qcom_scm.download_mode="mini"`` and then the system crashes - SMEM has a limit right ? So the minidump gets overwritten ? > +Similarly, ``"full"`` is passed to set the download mode to full dump > +where entire RAM dump will be collected while setting it ``"full,mini"`` > +will collect minidump along with fulldump. Still not super-clear what the difference between mini and full is here. > + > +Writing to sysfs node can also be used to set the mode to minidump:: > + > + echo "mini" > /sys/module/qcom_scm/parameter/download_mode > + > +Once the download mode is set, any kind of crash will make the device collect > +respective dump as per set download mode. Nice. > + > +Dump collection > +--------------- > +:: > + > + +-----------+ > + | | > + | | +------+ > + | | | | > + | | +--+---+ Product(Qualcomm SoC) > + +-----------+ | > + |+++++++++++|<------------+ > + |+++++++++++| usb cable > + +-----------+ > + x86_64 PC > + > +The solution supports a product running with Qualcomm SoC (where minidump) > +is supported from the firmware) connected to x86_64 host PC running PCAT > +tool. It supports downloading the minidump produced from product to the > +host PC over USB or to save the minidump to the product attached storage > +device(UFS/eMMC/SD Card) into minidump dedicated partition. It would be a good idea to reference this section earlier. > + > +By default, dumps are downloaded via USB to the attached x86_64 PC running > +PCAT (Qualcomm tool) software. Upon download, we will see a set of binary > +blobs starting with name ``md_*`` in PCAT configured directory in x86_64 > +machine, so for above example from the client it will be ``md_REGION_A.BIN``. > +This binary blob depends on region content to determine whether it needs > +external parser support to get the content of the region, so for simple > +plain ASCII text we don't need any parsing and the content can be seen > +just opening the binary file. > + > +To collect the dump to attached storage type, one needs to write appropriate > +value to IMEM register, in that case dumps are collected in rawdump > +partition on the product device itself. > + > +One needs to read the entire rawdump partition and pull out content to > +save it onto the attached x86_64 machine over USB. Later, this rawdump > +can be passed to another tool (``dexter.exe`` [Qualcomm tool]) which > +converts this into the similar binary blobs which we have got it when > +download type was set to USB, i.e. a set of registered regions as blobs > +and their name starts with ``md_*``. > + > +Replacing the ``dexter.exe`` with some open source tool can be added as future > +scope of this document. --- bod
Thanks you very much Bryan for giving so much time of yours in reviewing minidump DOC and I apologize for coming so late for your reply. I will definetely apply your suggestion in next series. Please find my reply inline for your queries. On 11/28/2023 6:45 PM, Bryan O'Donoghue wrote: > On 24/11/2023 22:19, Mukesh Ojha wrote: >> Add the qualcomm minidump guide for the users which tries to cover >> the dependency, API use and the way to test and collect minidump >> on Qualcomm supported SoCs. >> >> Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com> >> --- >> Documentation/admin-guide/index.rst | 1 + >> Documentation/admin-guide/qcom_minidump.rst | 272 >> ++++++++++++++++++++++++++++ >> 2 files changed, 273 insertions(+) >> create mode 100644 Documentation/admin-guide/qcom_minidump.rst > > General nit-pick on sequencing of your patches. Its good practice (tm) > to put your documentation first in your series. > >> diff --git a/Documentation/admin-guide/index.rst >> b/Documentation/admin-guide/index.rst >> index 43ea35613dfc..251d070486c2 100644 >> --- a/Documentation/admin-guide/index.rst >> +++ b/Documentation/admin-guide/index.rst >> @@ -120,6 +120,7 @@ configure specific aspects of kernel behavior to >> your liking. >> perf-security >> pm/index >> pnp >> + qcom_minidump >> rapidio >> ras >> rtc >> diff --git a/Documentation/admin-guide/qcom_minidump.rst >> b/Documentation/admin-guide/qcom_minidump.rst >> new file mode 100644 >> index 000000000000..b492f2b79639 >> --- /dev/null >> +++ b/Documentation/admin-guide/qcom_minidump.rst >> @@ -0,0 +1,272 @@ >> +Qualcomm minidump feature >> +========================= >> + >> +Introduction >> +------------ >> + >> +Minidump is a best effort mechanism to collect useful and predefined >> +data for first level of debugging on end user devices running on >> +Qualcomm SoCs. > > What does "first-level debugging" mean here ? You use the term > "post-mortem" later, stick with one term for consistency throughout your > document. First level debugging meant was to mean the amount of data captured is limited and may be sufficient to root cause an issue and you may need entire ramdump to debug the issue. > > Suggest: > > "Minidump is a best-effort mechanism to collect useful predefined data > for post-mortem debugging on a Qualcomm System on Chips (SoCs)." > > Generally better to stick with "Qualcomm SoC" and "Minidump" once you > establish the terms upfront and early in your text - instead of > reverting to "device" and "it". > > It is built on the premise that System on Chip (SoC) >> +or subsystem part of SoC crashes, due to a range of hardware and >> +software bugs. > Instead of saying "It" say "Minidump" Sure, will try to be consistent in term usage. > > Hence, the ability to collect accurate data is only >> +a best-effort. The data collected could be invalid or corrupted, data >> +collection itself could fail, and so on. > > "and so on" is a redundancy, drop. > > > Suggest: > Minidump is built on the premise that a hardware or software component > on the SoC has encountered an unexpected fault. This means that the data > collected by minidump cannot be assumed to be correct or even present. > >> + >> +Qualcomm devices in engineering mode provides a mechanism for generating >> +full system RAM dumps for post-mortem debugging. > > Stick with the established naming convention > > "Qualcomm SoCs in engineering mode provide a mechanism for generating > full system RAM dumps for this post-mortem debugging" > > But in some cases it's >> +however not feasible to capture the entire content of RAM. The minidump >> +mechanism provides the means for selected region should be included in >> +the ramdump. > > Dont' start a sentence with "But" Sure. > > You're also not being clear what you mean by "the entire content of RAM" > since you obviously can't capture a full snap-shot of DRAM and store in > DRAM. > > Better to say IMO > > "Minidump captures specific pre-defined regions of RAM and stores those > regions in a reserved Minidump specific buffer." > > "The region of RAM used to store Minidump data shall be referred to as > SMEM throughout this document." [1] ok > >> +:: >> + >> + +-----------------------------------------------+ >> + | DDR +-------------+ | > > Instead of saying "DDR" just mark this as RAM or Memory. > Its a terrible nit-pick from me but DDR = "Double Data Rate Synchronous > Dynamic Random-Access Memory" but that's irrelevant to this > specification. We could be living in a gigantic SRAM for argument sake. Sure. > >> + | | SS0-ToC| | >> + | +----------------+ +----------------+ | | >> + | |Shared memory | | SS1-ToC| | | >> + | |(SMEM) | | | | | >> + | | | +-->|--------+ | | | >> + | |G-ToC | | | SS-ToC \ | | | >> + | |+-------------+ | | | +-----------+ | | | >> + | ||-------------| | | | |-----------| | | | >> + | || SS0-ToC | | | +-|<|SS1 region1| | | | >> + | ||-------------| | | | | |-----------| | | | >> + | || SS1-ToC |-|>+ | | |SS1 region2| | | | >> + | ||-------------| | | | |-----------| | | | >> + | || SS2-ToC | | | | | ... | | | | >> + | ||-------------| | | | |-----------| | | | >> + | || ... | | |-|<|SS1 regionN| | | | >> + | ||-------------| | | | |-----------| | | | >> + | || SSn-ToC | | | | +-----------+ | | | >> + | |+-------------+ | | | | | | >> + | | | | |----------------| | | >> + | | | +>| regionN | | | >> + | | | | |----------------| | | >> + | +----------------+ | | | | | >> + | | |----------------| | | >> + | +>| region1 | | | >> + | |----------------| | | >> + | | | | | >> + | |----------------|-+ | >> + | | region5 | | >> + | |----------------| | >> + | | | | >> + | Region information +----------------+ | >> + | +---------------+ | >> + | |region name | | >> + | |---------------| | >> + | |region address | | >> + | |---------------| | >> + | |region size | | >> + | +---------------+ | >> + +-----------------------------------------------+ >> + G-ToC: Global table of contents >> + SS-ToC: Subsystem table of contents >> + SS0-SSn: Subsystem numbered from 0 to n > > You need to be more consistent here > > SSX-ToC -> SSX-SSn > > Where X is an integer from 0 upwards and similarly n is a integer from 0 > upwards. > > This name SSX-SSn is not especially descriptive, I'm not sure if this is > a name you are choosing here ? If so then let me suggest a new name like > "Subsystem Memory Segment" SSX-MSn looks good. > > -> > > G-ToC: Global table of contents > SSX-ToC: Subsystem X table of contents. > X is an integer in the range of 0 to ? > Is there an upper limit ? Older Soc had the limit of 10 but the newer SoC firmware this limit has increased. This reminds me of creating a patch to make the no of subsystem configurable through CONFIG_*. #define MAX_NUM_OF_SS 10 "drivers/remoteproc/qcom_common.c" > Presumably this is an 8, 16, 32 or 64 bit integer > Please define either the size of the integer or the > valid range of values 0..128, 0..256 > SSX-MSn: Subsystem numbered from 0 to n > Same comment for the 'n' here. Thanks again.. > >> + >> +It depends on SoC where the underlying firmware is keeping the >> +minidump global table taking care of subsystem ToC part for >> +minidump like for above diagram, it is for shared memory sitting >> +in DDR and it is shared among various master however it is possible > >> +that this could be implemented via memory mapped regions but the >> +general idea should remain same. Here, various subsystem could be >> +DSP's like ADSP/CDSP/MODEM etc, along with Application processor >> +(APSS) where Linux runs. > > > DSP minidump gets collected when DSP's goes >> +for recovery followed by a crash. The minidump part of code for >> +that resides in ``qcom_rproc_minidump.c``. > > This paragraph is difficult to parse. > > What you are describing here is a linked list, I think you should have a > paragraph describing how the memory structure works > > -> > > "Minidump determines which areas of DRAM to capture via a Minidump > defined linked-list structure. > > At the top level a Global Table of Contents (GTOC) enumerates a variable > number of SubSystem Table Of Contents (SSTOC) structures. > > Each SSTOC contains a list of SubSystem Memory Segements which are named > according to the containing SSTOC hence (SSX-MSn) where "X" denotes the > SystemSystem index of the containing SSX-ToC and "n" denotes an > individual Memory Segment within the SystemSystem. Hence SS0-MS0 belongs > to SS0-ToC whereas SS1-MS0 belongs to SS1-ToC." Thanks for this. > > Then I think you can describe how the crash dump colleciton works and > which agents of the system - DSP ? is responsible for collecting the > crashdump > > -> > > "The Application Processor SubSystem (APSS) runs the Linux kernel and is > therefore not responsible for assembling Minidump data. One of the other > system agents in the SoC will be responsible for capturing the Minidump > data during system reset. > > Typically one of the SoC Digital Signal Processors (DSP) will be used > for this purpose. > > During reset the DSP will walk the GTOC, SSX-ToCs and SSX-MSns > populating the Minidump RAM area with the indicated memory" > >> + >> + >> +SMEM as backend > > >> +---------------- >> + >> +In this document, SMEM will be used as the backend implementation >> +of minidump. > > [1] As per the above link, you need to introduce the term SMEM earlier. > > It's fine to expand on its meaning later but, do please define it once > upfront before you use it in your awesome ASCII art. Sure. > >> +The core of minidump feature is part of Qualcomm's boot firmware code. >> +It initializes shared memory (SMEM), which is a part of DDR and >> +allocates a small section of it to minidump table, i.e. also called >> +global table of contents (G-ToC). Each subsystem (APSS, ADSP, ...) has >> +its own table of segments to be included in the minidump, all >> +references from a descriptor in SMEM (G-ToC). Each segment/region has >> +some details like name, physical address and its size etc. and it >> +could be anywhere scattered in the DDR. > > -> > > "The SoC's bootloader must reserve an area of RAM as SMEM prior to > handing over control to the run-time operating system. The bootloader is > responsible to place the GTOC at the starting address of SMEM." > > If you want to give more technical details of size, physical address - > then explicitly define those in the section above which talks about the > linked-list structure. > > Please try to avoid use of "etc" or "and so on" since it assumes the > reader already knows how the system works and can fill in the blanks > but, what you are doing here is educating a Minidump novice in how > things work. ok. > >> + >> +Qualcomm APSS Minidump kernel driver concept >> +-------------------------------------------- >> + >> +Qualcomm APSS minidump kernel driver adds the capability to add Linux > > So why "Minidump" and then "minidump" choose one. > >> +region to be dumped as part of RAM dump collection. > > > OK so this really is the "meat" of the system. The bootloader/firmware > populates the GTOC. > > The Q this document should probably answer is how the kernel driver > knows how/where to place its data. > > Assumed to be parsing the DTB. > > At the moment, >> +shared memory driver creates platform device for minidump driver and >> +give a means to APSS minidump to initialize itself on probe. > > "At the moment" is another drop. > > Just make a clear statement > > "The minidump platform driver populates the APSS porition of the GTOC" > > more interesting to me is - are there defined numbers, identifiers for > the APSS ? or do we just add new entries to the GTOC ? > > ie. is there a reserved index or "type" in the GTOC that identifies > where the APSS needs to insert itself ? Correct. APSS has similar index i.e, 0 similar to minidump index for other remoteprocs mentioned belowin GTOC. https://elixir.bootlin.com/linux/v6.7-rc5/source/drivers/remoteproc/qcom_q6v5_pas.c#L1113 https://elixir.bootlin.com/linux/v6.7-rc5/source/drivers/remoteproc/qcom_q6v5_pas.c#L1133 https://elixir.bootlin.com/linux/v6.7-rc5/source/drivers/remoteproc/qcom_q6v5_pas.c#L1153 qcom_smem_get(QCOM_SMEM_HOST_ANY, SBL_MINIDUMP_SMEM_ID, &size); Above API is the key, And given SMEM ID it gives the object detail present in shared memory and whether it is Minidump global table structure object address in case of minidump or SoC info object in case of socinfo related information (drivers/soc/qcom/socinfo.c). > >> +This driver provides ``qcom_minidump_region_register`` and >> +``qcom_minidump_region_unregister`` API's to register and unregister >> +APSS minidump region. > > Why does it do that ? Is it not the case that the driver knows where the > APSS data goes ? If you mean why does it provide qcom_minidump_region_register/qcom_minidump_region_unregister E.g., it is for the linux clients to upfront register their buffer or object which they want to see to check once there is crash from their driver and same is true other linux clients. In the current state of the driver it does not export these symbols and only captures ramoops frontends data like console, dmesg, ftrace, pmesg but there is plan to do that when one of the upstream client driver interested in capturing something from their driver they have to use this API to register it with APSS minidump table. > > It also supports registration for the clients >> +who came before minidump driver was initialized. It maintains pending >> +list of clients who came before minidump and once minidump is >> initialized >> +it registers them in one go. > > Don't start sentences with "It" -> "The driver" or "Minidump" > > As I read this though, the Minidump driver in Linux isn't just > registering / managing the APSS side of things but also "doing stuff" > for other clients ? > > How does the Linux driver know what to register ? Again, I am reiterating what i said above, it is upto the client driver interests what they want to check from their driver once they see a crash from there driver, it could be some buffer content or structure object or anything. Minidump is kind of blind what are you capturing is not known to itself. > >> + >> +To simplify post-mortem debugging, driver creates and maintain an ELF > > the driver creates and maintains > >> +header as first region that gets updated each time a new region gets >> +registered. > > as the first region > > So - who is registering these regions ? Linux kernel drivers ? aDSP / > cDSP ? For APSS(linux), driver register it through the call made to qcom_minidump_region_{register|unregister} by the linux clients. Here, it provide flexiblity to dynamically register the regions. For ADSP/CDSP/MODEM, they will have their own way which is mostly statically decided what addresses range they want to capture on their crash. > > If I write a new driver for Venus or GPU can I define my own region(s) > to be captured ? I am not an expert of Venus/GPU but AFAIK, currently their firmware side don't use minidump but you can still use APSS minidump to register required their host regions from GPU/VENUS linux drivers once APSS minidump driver exports register/unregister API. > > Presumably. Please give more detail on this. > >> + >> +The solution supports extracting the RAM dump/minidump produced either >> +over USB or stored to an attached storage device. > > What provides that functionality ? The bootloader ? Yes, the bootloader. > > How do you trigger / capture that dump from the bootloader ? This is explained how to set minidump and once set any region of SOC crash will collect minidump and if not set by default entire ddr dump is collected. > > No need to go into super-detail but give some idea. > >> + >> +Dependency of minidump kernel driver >> +------------------------------------ >> + >> +It is to note that whole of minidump depends on Qualcomm boot firmware >> +whether it supports minidump or not. > > You can drop this - you've already stated the bootloader/firmware must > setup the initial table so, you're not providing additional information > with this statement. > >> So, if the minidump SMEM ID is > > Try not to start sentences with "So" > > SMEM ID ? This is your first time using this term - please relate it > back to your ASCII diagram and the description you give with that text. > >> +present in shared memory, it indicates that minidump is supported from >> +boot firmware and it is possible to dump Linux (APSS) region as part >> +of minidump collection. > > If _which_ SMEM ID ? > > It seems to me as if we are missing some important information here - > what are the list of SMEM IDs ? > > Are these IDs serial and incrementing across SoC versions or SoC specific ? I have described above in my explanation probably it should make sense now, will also try to add it in the doc. > > You need to define that data. > >> +How a kernel client driver can register region with minidump >> +------------------------------------------------------------ > > Answering yes to my earlier question. A driver I write can make use of > the API you are providing here. > > Great. Please give some indication of that earlier, even if its a > reference to this description you give here "See X.Y later in this > document" > >> + >> +Client driver can use ``qcom_minidump_region_register`` API's to >> register >> +and ``qcom_minidump_region_unregister`` to unregister their region from >> +minidump driver. >> + >> +Client needs to fill their region by filling ``qcom_minidump_region`` >> +structure object which consists of the region name, region's virtual >> +and physical address and its size. > > Nit pick. You need a definite article here "A client driver" etc. > >> + >> +Below, is one sample client driver snippet which tries to allocate a >> +region from kernel heap of certain size and it writes a certain known >> +pattern. > > Good > > (that can help in verification after collection that we got >> +the exact pattern, what we wrote) and registers it with minidump. > > Not necessary to define this. We are all smart here and by now the > intent of the mechanism is defined.. > >> + >> + .. code-block:: c >> + >> + #include <soc/qcom/qcom_minidump.h> >> + [...] >> + >> + >> + [... inside a function ...] >> + struct qcom_minidump_region region; >> + >> + [...] >> + >> + client_mem_region = kzalloc(region_size, GFP_KERNEL); >> + if (!client_mem_region) >> + return -ENOMEM; >> + >> + [... Just write a pattern ...] >> + memset(client_mem_region, 0xAB, region_size); >> + >> + [... Fill up the region object ...] >> + strlcpy(region.name, "REGION_A", sizeof(region.name)); >> + region.virt_addr = client_mem_region; >> + region.phys_addr = virt_to_phys(client_mem_region); >> + region.size = region_size; >> + >> + ret = qcom_minidump_region_register(®ion); >> + if (ret < 0) { >> + pr_err("failed to add region in minidump: err: %d\n", ret); >> + return ret; >> + } >> + >> + [...] >> + >> + >> +Test >> +---- > > Testing > >> + >> +Existing Qualcomm devices already supports entire RAM dump (also called >> +full dump) by writing appropriate value to Qualcomm's top control and >> +status register (tcsr) in ``driver/firmware/qcom_scm.c`` . > > "Existing Qualcomm SoCs already support dumping the entire RAM to the > SMEM area/segment/whatever" > > This is 100% counter-intuitive since SMEM lives in RAM, correct ? > > Full dump means what, a full dump of the APSS RAM ? What happens if SMEM > cannot accommodate the full APSS RAM dump ? Sorry, I never said full dump is captured in SMEM, the only difference of minidump with full dump is, no region registration is required so no need of SMEM(where the meta data is there right ?) and entire ddr dump is collected by firmware/bootloader and it sent to to host connected machine via USB through Sahara protocol. > >> + >> +SCM device Tree bindings required to support download mode >> +For example (sm8450) :: >> + >> + / { >> + >> + [...] >> + >> + firmware { >> + scm: scm { >> + compatible = "qcom,scm-sm8450", "qcom,scm"; >> + [... tcsr register ... ] >> + qcom,dload-mode = <&tcsr 0x13000>; >> + >> + [...] >> + }; >> + }; >> + >> + [...] >> + >> + soc: soc@0 { >> + >> + [...] >> + >> + tcsr: syscon@1fc0000 { >> + compatible = "qcom,sm8450-tcsr", "syscon"; >> + reg = <0x0 0x1fc0000 0x0 0x30000>; >> + }; >> + >> + [...] >> + }; >> + [...] >> + >> + }; >> + >> +User of minidump can pass ``qcom_scm.download_mode="mini"`` to kernel >> +commandline to set the current download mode to minidump. > > "A kernel command line parameter is provided > ``qcom_scm.download_mode="mini"`` to facilitate ... but you aren't > telling us what "minidump" captures "the current download" ? do you mean > the current state ? Minidump is a SoC solution and not specific to linux(APSS) so even if there is no APSS linux driver to capture linux specific regions, firmware like ADSP/CDSP/MODEM or Trustzone or boot loader still registers their own regions that get captured and you will get them on crash. If there is no registration from anywhere it will not dump anything. > > Does the system continue to boot up if you pass > qcom_scm.download_mode="mini ? will additional registrations to > SMEM/Minidump work ? yes, it will continue to boot, this parameter just convert the capture mode from full dump (entire ddr dump ) to minidump (smem) registered region dump. Yes, additional registration of linux region to SMEM will work. > > What happens to the minidump data if there is a _subsequent_ real > crashdump ? By passing qcom_scm.download_mode="mini" we are converting crash dump mode from full(complete ddr) to minidump. > > Overwritten ? > > Also what happens if SMEM runs out of space ? Say I boot with > ``qcom_scm.download_mode="mini"`` and then the system crashes - SMEM has > a limit right ? > > So the minidump gets overwritten ? I think, is this after your confusion full dump also gets written to SMEM ? if this crashes before qcom_scm.download_mode="mini" mode was set then full ramdump gets collected which is default gets set from the firmware however, it is also possible to set minidump by default from the firmware, but let's no go into that detail. > >> +Similarly, ``"full"`` is passed to set the download mode to full dump >> +where entire RAM dump will be collected while setting it ``"full,mini"`` >> +will collect minidump along with fulldump. > > Still not super-clear what the difference between mini and full is here. It should be clear by now. > >> + >> +Writing to sysfs node can also be used to set the mode to minidump:: >> + >> + echo "mini" > /sys/module/qcom_scm/parameter/download_mode >> + >> +Once the download mode is set, any kind of crash will make the device >> collect >> +respective dump as per set download mode. > > Nice. > >> + >> +Dump collection >> +--------------- >> +:: >> + >> + +-----------+ >> + | | >> + | | +------+ >> + | | | | >> + | | +--+---+ Product(Qualcomm SoC) >> + +-----------+ | >> + |+++++++++++|<------------+ >> + |+++++++++++| usb cable >> + +-----------+ >> + x86_64 PC >> + >> +The solution supports a product running with Qualcomm SoC (where >> minidump) >> +is supported from the firmware) connected to x86_64 host PC running PCAT >> +tool. > > It supports downloading the minidump produced from product to the >> +host PC over USB or to save the minidump to the product attached storage >> +device(UFS/eMMC/SD Card) into minidump dedicated partition. > > It would be a good idea to reference this section earlier. Ok. Thanks again, for going into such minute detail and asking these questions. I really appriciate the interest. -Mukesh > >> + >> +By default, dumps are downloaded via USB to the attached x86_64 PC >> running >> +PCAT (Qualcomm tool) software. Upon download, we will see a set of >> binary >> +blobs starting with name ``md_*`` in PCAT configured directory in x86_64 >> +machine, so for above example from the client it will be >> ``md_REGION_A.BIN``. >> +This binary blob depends on region content to determine whether it needs >> +external parser support to get the content of the region, so for simple >> +plain ASCII text we don't need any parsing and the content can be seen >> +just opening the binary file. >> + >> +To collect the dump to attached storage type, one needs to write >> appropriate >> +value to IMEM register, in that case dumps are collected in rawdump >> +partition on the product device itself. >> + >> +One needs to read the entire rawdump partition and pull out content to >> +save it onto the attached x86_64 machine over USB. Later, this rawdump >> +can be passed to another tool (``dexter.exe`` [Qualcomm tool]) which >> +converts this into the similar binary blobs which we have got it when >> +download type was set to USB, i.e. a set of registered regions as blobs >> +and their name starts with ``md_*``. >> + >> +Replacing the ``dexter.exe`` with some open source tool can be added >> as future >> +scope of this document. > > --- > bod
<+How a kernel client driver can register region with minidump <+------------------------------------------------------------ <+ <+Client driver can use ``qcom_minidump_region_register`` API's to register <+and ``qcom_minidump_region_unregister`` to unregister their region from <+minidump driver. <+ <+Client needs to fill their region by filling ``qcom_minidump_region`` <+structure object which consists of the region name, region's virtual <+and physical address and its size. Hi, Mukesh, wish you a good holiday :) I have the following idea, please help me to assess whether this can be implemented or not. As we all know, most of the kernel objects are allocated by the slab sub-system.I wonder if we can dump all memory keeped by the slab sub-system? If so, we got most of the kernel objects which will be helpful to fix problems when we run with system issues. How can we do this? From the description above, I think we should register one region for each slab, for each slab will have some pages, and the memory between each slab is non-continuous. As we all know, there are millions of slabs in the system, so if we dump slabs in this way, it will introduce a heavy overhead. I am not very familiar with qualcomm minidump, maybe my thought is wrong. Looking forward to your reply! Best Regards Ruipeng
On 12/25/2023 7:25 PM, Ruipeng Qi wrote: > <+How a kernel client driver can register region with minidump > <+------------------------------------------------------------ > <+ > <+Client driver can use ``qcom_minidump_region_register`` API's to register > <+and ``qcom_minidump_region_unregister`` to unregister their region from > <+minidump driver. > <+ > <+Client needs to fill their region by filling ``qcom_minidump_region`` > <+structure object which consists of the region name, region's virtual > <+and physical address and its size. > > Hi, Mukesh, wish you a good holiday :) Hope you had the same..:-) > > I have the following idea, please help me to assess whether this can be > implemented or not. As we all know, most of the kernel objects are > allocated by the slab sub-system.I wonder if we can dump all memory > keeped by the slab sub-system? If so, we got most of the kernel objects > which will be helpful to fix problems when we run with system issues. > > How can we do this? From the description above, I think we should > register one region for each slab, for each slab will have some pages, > and the memory between each slab is non-continuous. As we all > know, there are millions of slabs in the system, so if we dump slabs > in this way, it will introduce a heavy overhead. > > I am not very familiar with qualcomm minidump, maybe my thought > is wrong. Looking forward to your reply! In the current state and in simple terms, Qualcomm Minidump can not do this, Minidump is more of a consumer driver so, what ever gets registered with it, it can dump. Qualcomm Minidump serves bigger purpose to dump content in any kind of crash whether it is kernel or non-kernel like NOC errors/XPUs etc and both kernel/non-kernel entity can register to it, so we gets dump in any kind of system crash. One more thing, kernel part of minidump, we are calling it APSS Minidump has limitation of no of entries so it will be difficult to dump non-continuous regions after a certain number of registration ~200. However, we do have a solution in downstream kernel for it like to create a big CMA buffer and register this buffer with Minidump so that whatever gets dumped in that buffer gets captured during crash and fill up this buffer and create elf during panic. I think, similar thing you are also doing with your OS-minidump. I have just glanced into your implementation of OS-minidump, it more of relying on basic concept of RAM content preserved across boot and later reading it through procfs but this basic stuff is common to pstore(ram) as well and pstore has file system support why don't you make your driver as one of pstore record and that way Qualcomm minidump also gets benefited where entire OS-minidump record gets registered with Qualcomm minidump and we get this on panic and you get this via pstorefs. -Mukesh > > Best Regards > Ruipeng
On Wed, Jan 3, 2024 at 11:27 PM Mukesh Ojha <quic_mojha@quicinc.com> wrote: > > > One more thing, kernel part of minidump, we are calling it APSS Minidump > has limitation of no of entries so it will be difficult to dump > non-continuous regions after a certain number of registration ~200. However, > we do have a solution in downstream kernel for it like to create a big > CMA buffer and register this buffer with Minidump so that whatever gets > dumped in that buffer gets captured during crash and fill up this buffer > and create elf during panic. I think, similar thing you are also doing > with your OS-minidump. > > I have just glanced into your implementation of OS-minidump, it > more of relying on basic concept of RAM content preserved > across boot and later reading it through procfs but this basic > stuff is common to pstore(ram) as well and pstore has file system > support why don't you make your driver as one of pstore record and that > way Qualcomm minidump also gets benefited where entire OS-minidump > record gets registered with Qualcomm minidump and we get this on panic > and you get this via pstorefs. > Thanks Mukesh!It is a good suggestion to move OS-minidump forward! By the way, I have some questions here for which I need your assistance. Firstly,I can reimplement OS-minidump as one of the pstore records to dump data. The resulting dump file would contain thousands of non-contiguous memory regions, each with only the virtual address and size recorded. As far as I know, Qualcomm's minidump can handle several memory regions, each with a physical address and size. This seems to be a difference, and I'm curious as to how you deal with data dumped by OS-minidump. I would really appreciate it if you could provide more details on your approach. Secondly, what tools do you use to analyze the dump data, and does it support crash tool? Lastly, is Qualcomm minidump compatible with non-Qualcomm SoCs, and if so, how can one use it? Best Regards Ruipeng Qi
On 1/8/2024 9:04 PM, Ruipeng Qi wrote: > On Wed, Jan 3, 2024 at 11:27 PM Mukesh Ojha <quic_mojha@quicinc.com> wrote: >> >> >> One more thing, kernel part of minidump, we are calling it APSS Minidump >> has limitation of no of entries so it will be difficult to dump >> non-continuous regions after a certain number of registration ~200. However, >> we do have a solution in downstream kernel for it like to create a big >> CMA buffer and register this buffer with Minidump so that whatever gets >> dumped in that buffer gets captured during crash and fill up this buffer >> and create elf during panic. I think, similar thing you are also doing >> with your OS-minidump. >> >> I have just glanced into your implementation of OS-minidump, it >> more of relying on basic concept of RAM content preserved >> across boot and later reading it through procfs but this basic >> stuff is common to pstore(ram) as well and pstore has file system >> support why don't you make your driver as one of pstore record and that >> way Qualcomm minidump also gets benefited where entire OS-minidump >> record gets registered with Qualcomm minidump and we get this on panic >> and you get this via pstorefs. >> > Thanks Mukesh!It is a good suggestion to move OS-minidump forward! > By the way, I have some questions here for which I need your assistance. > > Firstly,I can reimplement OS-minidump as one of the pstore records to > dump data. The resulting dump file would contain thousands of > non-contiguous memory regions, each with only the virtual address and > size recorded. As far as I know, Qualcomm's minidump can handle > several memory regions, each with a physical address and size. > This seems to be a difference, and I'm curious as to how you deal with > data dumped by OS-minidump. I would really appreciate it if you could > provide more details on your approach. What my thought was to think your OS-minidump to be one of pstore record similar to existing records like console, ftrace, pmsg, dmesg etc., If you follow this series patch 11/12 and 12/12 is trying to get the pstore(ram) record information and registering with minidump and here the physical address are of the ramoops record addresses. So, once you are capturing everything inside in a record, all minidump has to do is get your Os-minidump record physical address and size and register with minidump. > > Secondly, what tools do you use to analyze the dump data, and does it > support crash tool? Currently, we are trying to capture only pstore ramoops region in text format and not been using any tool. Since, Qualcomm minidump is controlled from boot firmware and it can not be used on non-Qualcomm SoCs so here minidump driver and its usecase is limited to capture only pstore (ram)records for targets where RAM content is not guaranteed to be preserved across boots. So, you can think minidump as one of ramoops backend which will be dumping all the ramoops regions/records/zones. +---------+ +---------+ +--------+ +---------+ | console | | pmsg | | ftrace | | dmesg | ...Os-minidump +---------+ +---------+ +--------+ +---------+ | | | | | | | | +------------------------------------------+ | \ / +----------------+ (1) |pstore frontends| +----------------+ | \ / +------------------- + (2) | pstore backend(ram)| +--------------------+ | \ / +---------------+ (3) | qcom_minidump | +---------------+ > > Lastly, is Qualcomm minidump compatible with non-Qualcomm SoCs, > and if so, how can one use it? I already replied it above. -Mukesh > > Best Regards > Ruipeng Qi
diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst index 43ea35613dfc..251d070486c2 100644 --- a/Documentation/admin-guide/index.rst +++ b/Documentation/admin-guide/index.rst @@ -120,6 +120,7 @@ configure specific aspects of kernel behavior to your liking. perf-security pm/index pnp + qcom_minidump rapidio ras rtc diff --git a/Documentation/admin-guide/qcom_minidump.rst b/Documentation/admin-guide/qcom_minidump.rst new file mode 100644 index 000000000000..b492f2b79639 --- /dev/null +++ b/Documentation/admin-guide/qcom_minidump.rst @@ -0,0 +1,272 @@ +Qualcomm minidump feature +========================= + +Introduction +------------ + +Minidump is a best effort mechanism to collect useful and predefined +data for first level of debugging on end user devices running on +Qualcomm SoCs. It is built on the premise that System on Chip (SoC) +or subsystem part of SoC crashes, due to a range of hardware and +software bugs. Hence, the ability to collect accurate data is only +a best-effort. The data collected could be invalid or corrupted, data +collection itself could fail, and so on. + +Qualcomm devices in engineering mode provides a mechanism for generating +full system RAM dumps for post-mortem debugging. But in some cases it's +however not feasible to capture the entire content of RAM. The minidump +mechanism provides the means for selected region should be included in +the ramdump. + + +:: + + +-----------------------------------------------+ + | DDR +-------------+ | + | | SS0-ToC| | + | +----------------+ +----------------+ | | + | |Shared memory | | SS1-ToC| | | + | |(SMEM) | | | | | + | | | +-->|--------+ | | | + | |G-ToC | | | SS-ToC \ | | | + | |+-------------+ | | | +-----------+ | | | + | ||-------------| | | | |-----------| | | | + | || SS0-ToC | | | +-|<|SS1 region1| | | | + | ||-------------| | | | | |-----------| | | | + | || SS1-ToC |-|>+ | | |SS1 region2| | | | + | ||-------------| | | | |-----------| | | | + | || SS2-ToC | | | | | ... | | | | + | ||-------------| | | | |-----------| | | | + | || ... | | |-|<|SS1 regionN| | | | + | ||-------------| | | | |-----------| | | | + | || SSn-ToC | | | | +-----------+ | | | + | |+-------------+ | | | | | | + | | | | |----------------| | | + | | | +>| regionN | | | + | | | | |----------------| | | + | +----------------+ | | | | | + | | |----------------| | | + | +>| region1 | | | + | |----------------| | | + | | | | | + | |----------------|-+ | + | | region5 | | + | |----------------| | + | | | | + | Region information +----------------+ | + | +---------------+ | + | |region name | | + | |---------------| | + | |region address | | + | |---------------| | + | |region size | | + | +---------------+ | + +-----------------------------------------------+ + G-ToC: Global table of contents + SS-ToC: Subsystem table of contents + SS0-SSn: Subsystem numbered from 0 to n + +It depends on SoC where the underlying firmware is keeping the +minidump global table taking care of subsystem ToC part for +minidump like for above diagram, it is for shared memory sitting +in DDR and it is shared among various master however it is possible +that this could be implemented via memory mapped regions but the +general idea should remain same. Here, various subsystem could be +DSP's like ADSP/CDSP/MODEM etc, along with Application processor +(APSS) where Linux runs. DSP minidump gets collected when DSP's goes +for recovery followed by a crash. The minidump part of code for +that resides in ``qcom_rproc_minidump.c``. + + +SMEM as backend +---------------- + +In this document, SMEM will be used as the backend implementation +of minidump. + +The core of minidump feature is part of Qualcomm's boot firmware code. +It initializes shared memory (SMEM), which is a part of DDR and +allocates a small section of it to minidump table, i.e. also called +global table of contents (G-ToC). Each subsystem (APSS, ADSP, ...) has +its own table of segments to be included in the minidump, all +references from a descriptor in SMEM (G-ToC). Each segment/region has +some details like name, physical address and its size etc. and it +could be anywhere scattered in the DDR. + +Qualcomm APSS Minidump kernel driver concept +-------------------------------------------- + +Qualcomm APSS minidump kernel driver adds the capability to add Linux +region to be dumped as part of RAM dump collection. At the moment, +shared memory driver creates platform device for minidump driver and +give a means to APSS minidump to initialize itself on probe. + +This driver provides ``qcom_minidump_region_register`` and +``qcom_minidump_region_unregister`` API's to register and unregister +APSS minidump region. It also supports registration for the clients +who came before minidump driver was initialized. It maintains pending +list of clients who came before minidump and once minidump is initialized +it registers them in one go. + +To simplify post-mortem debugging, driver creates and maintain an ELF +header as first region that gets updated each time a new region gets +registered. + +The solution supports extracting the RAM dump/minidump produced either +over USB or stored to an attached storage device. + +Dependency of minidump kernel driver +------------------------------------ + +It is to note that whole of minidump depends on Qualcomm boot firmware +whether it supports minidump or not. So, if the minidump SMEM ID is +present in shared memory, it indicates that minidump is supported from +boot firmware and it is possible to dump Linux (APSS) region as part +of minidump collection. + +How a kernel client driver can register region with minidump +------------------------------------------------------------ + +Client driver can use ``qcom_minidump_region_register`` API's to register +and ``qcom_minidump_region_unregister`` to unregister their region from +minidump driver. + +Client needs to fill their region by filling ``qcom_minidump_region`` +structure object which consists of the region name, region's virtual +and physical address and its size. + +Below, is one sample client driver snippet which tries to allocate a +region from kernel heap of certain size and it writes a certain known +pattern (that can help in verification after collection that we got +the exact pattern, what we wrote) and registers it with minidump. + + .. code-block:: c + + #include <soc/qcom/qcom_minidump.h> + [...] + + + [... inside a function ...] + struct qcom_minidump_region region; + + [...] + + client_mem_region = kzalloc(region_size, GFP_KERNEL); + if (!client_mem_region) + return -ENOMEM; + + [... Just write a pattern ...] + memset(client_mem_region, 0xAB, region_size); + + [... Fill up the region object ...] + strlcpy(region.name, "REGION_A", sizeof(region.name)); + region.virt_addr = client_mem_region; + region.phys_addr = virt_to_phys(client_mem_region); + region.size = region_size; + + ret = qcom_minidump_region_register(®ion); + if (ret < 0) { + pr_err("failed to add region in minidump: err: %d\n", ret); + return ret; + } + + [...] + + +Test +---- + +Existing Qualcomm devices already supports entire RAM dump (also called +full dump) by writing appropriate value to Qualcomm's top control and +status register (tcsr) in ``driver/firmware/qcom_scm.c`` . + +SCM device Tree bindings required to support download mode +For example (sm8450) :: + + / { + + [...] + + firmware { + scm: scm { + compatible = "qcom,scm-sm8450", "qcom,scm"; + [... tcsr register ... ] + qcom,dload-mode = <&tcsr 0x13000>; + + [...] + }; + }; + + [...] + + soc: soc@0 { + + [...] + + tcsr: syscon@1fc0000 { + compatible = "qcom,sm8450-tcsr", "syscon"; + reg = <0x0 0x1fc0000 0x0 0x30000>; + }; + + [...] + }; + [...] + + }; + +User of minidump can pass ``qcom_scm.download_mode="mini"`` to kernel +commandline to set the current download mode to minidump. +Similarly, ``"full"`` is passed to set the download mode to full dump +where entire RAM dump will be collected while setting it ``"full,mini"`` +will collect minidump along with fulldump. + +Writing to sysfs node can also be used to set the mode to minidump:: + + echo "mini" > /sys/module/qcom_scm/parameter/download_mode + +Once the download mode is set, any kind of crash will make the device collect +respective dump as per set download mode. + +Dump collection +--------------- +:: + + +-----------+ + | | + | | +------+ + | | | | + | | +--+---+ Product(Qualcomm SoC) + +-----------+ | + |+++++++++++|<------------+ + |+++++++++++| usb cable + +-----------+ + x86_64 PC + +The solution supports a product running with Qualcomm SoC (where minidump) +is supported from the firmware) connected to x86_64 host PC running PCAT +tool. It supports downloading the minidump produced from product to the +host PC over USB or to save the minidump to the product attached storage +device(UFS/eMMC/SD Card) into minidump dedicated partition. + +By default, dumps are downloaded via USB to the attached x86_64 PC running +PCAT (Qualcomm tool) software. Upon download, we will see a set of binary +blobs starting with name ``md_*`` in PCAT configured directory in x86_64 +machine, so for above example from the client it will be ``md_REGION_A.BIN``. +This binary blob depends on region content to determine whether it needs +external parser support to get the content of the region, so for simple +plain ASCII text we don't need any parsing and the content can be seen +just opening the binary file. + +To collect the dump to attached storage type, one needs to write appropriate +value to IMEM register, in that case dumps are collected in rawdump +partition on the product device itself. + +One needs to read the entire rawdump partition and pull out content to +save it onto the attached x86_64 machine over USB. Later, this rawdump +can be passed to another tool (``dexter.exe`` [Qualcomm tool]) which +converts this into the similar binary blobs which we have got it when +download type was set to USB, i.e. a set of registered regions as blobs +and their name starts with ``md_*``. + +Replacing the ``dexter.exe`` with some open source tool can be added as future +scope of this document.
Add the qualcomm minidump guide for the users which tries to cover the dependency, API use and the way to test and collect minidump on Qualcomm supported SoCs. Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com> --- Documentation/admin-guide/index.rst | 1 + Documentation/admin-guide/qcom_minidump.rst | 272 ++++++++++++++++++++++++++++ 2 files changed, 273 insertions(+) create mode 100644 Documentation/admin-guide/qcom_minidump.rst