[meta-xilinx] Another ZynqMP QSPI linux-xlnx driver issue

Thu Oct 12 10:11:39 PDT 2017

I've found and fixed several issues with the ZynqMP QSPI driver, as well 
as adding support for IO mode in the u-boot-xlnx QSPI driver to enable 
loading images from a UBIFS partition prior to booting the kernel. The end 
goal here was to load my kernel, dtb, and bitstream from a rootfs, with 
the QSPI containing multiple root file systems -- enabling an 
active/inactive partition scheme as laid out by mender.io.

After spending roughly a week tracking down what I hope to be the final 
issue I am submitting a patch for this driver. This issue was observed 
with the symptom of 'garbled'/'corrupted' files in the ubnized rootfs. 
Here is the example that I used to track this down:

I have a file in my rootfs which should be located at 
/etc/mender/mender.conf. The expected file contents are --
root at zcu102-zynqmp:~# cat /etc/mender/mender.conf
{
    "InventoryPollIntervalSeconds": 1800,
    "RetryPollIntervalSeconds": 300,
    "RootfsPartA": "ubi0_0",
    "RootfsPartB": "ubi0_1",
    "ServerCertificate": "/etc/mender/server.crt",
    "ServerURL": "https://docker.mender.io",
    "TenantToken": "dummy",
    "UpdatePollIntervalSeconds": 1800
}

However, after creating my rootfs, flashing it, and mounting it I observe 
the following file contents:
root at zcu102-zynqmp:~# cat /etc/mender/mender.conf
{
    "InventoryPollIntervalSeconds": 1800,

}"RetryPollIntervalSeconds": 300,
    "RootfsPartA": "ubi0_0",
    "RootfsPartB": "ubi0_1",
    "ServerCertificate": "/etc/mender/server.crt",
    "ServerURL": "https://docker.mender.io",
    "TenantToken": "dummy",
    "UpdatePollIntervalSeconds": 1800


As mentioned, it took a while to track this issue down. Some of my process 
below -- 
- I validated the resulting .ubi image file using ubi_reader on the build 
machine
- I built a version of ubiformat with a --confirm option (google to find 
the source) and found that there were miscompares during ubiformat writing
- I used hexdump and grep to find the file contents in the flash. The 
contents in flash looked fine --
root at zcu102-zynqmp:~# hexdump -C /dev/mtd2 -s 0x027ABE00 -n 700
027abe00  00 00 00 00 00 00 00 00  2e 01 00 00 00 00 00 00 
 |................|
027abe10  7b 0a 20 20 20 20 22 49  6e 76 65 6e 74 6f 72 79  |{.   
 "Inventory|
027abe20  50 6f 6c 6c 49 6e 74 65  72 76 61 6c 53 65 63 6f 
 |PollIntervalSeco|
027abe30  6e 64 73 22 3a 20 31 38  30 30 2c 0a 20 20 20 20  |nds": 1800,. 
   |
027abe40  22 52 65 74 72 79 50 6f  6c 6c 49 6e 74 65 72 76 
 |"RetryPollInterv|
027abe50  61 6c 53 65 63 6f 6e 64  73 22 3a 20 33 30 30 2c  |alSeconds": 
300,|
027abe60  0a 20 20 20 20 22 52 6f  6f 74 66 73 50 61 72 74  |.   
 "RootfsPart|
027abe70  41 22 3a 20 22 75 62 69  30 5f 30 22 2c 0a 20 20  |A": 
"ubi0_0",.  |
027abe80  20 20 22 52 6f 6f 74 66  73 50 61 72 74 42 22 3a  | 
 "RootfsPartB":|
027abe90  20 22 75 62 69 30 5f 31  22 2c 0a 20 20 20 20 22  | "ubi0_1",.   
 "|
027abea0  53 65 72 76 65 72 43 65  72 74 69 66 69 63 61 74 
 |ServerCertificat|
027abeb0  65 22 3a 20 22 2f 65 74  63 2f 6d 65 6e 64 65 72  |e": 
"/etc/mender|
027abec0  2f 73 65 72 76 65 72 2e  63 72 74 22 2c 0a 20 20 
 |/server.crt",.  |
027abed0  20 20 22 53 65 72 76 65  72 55 52 4c 22 3a 20 22  | 
 "ServerURL": "|
027abee0  68 74 74 70 73 3a 2f 2f  64 6f 63 6b 65 72 2e 6d  |
https://docker.m|
027abef0  65 6e 64 65 72 2e 69 6f  22 2c 0a 20 20 20 20 22  |ender.io",.   
 "|
027abf00  54 65 6e 61 6e 74 54 6f  6b 65 6e 22 3a 20 22 64  |TenantToken": 
"d|
027abf10  75 6d 6d 79 22 2c 0a 20  20 20 20 22 55 70 64 61  |ummy",.   
 "Upda|
027abf20  74 65 50 6f 6c 6c 49 6e  74 65 72 76 61 6c 53 65 
 |tePollIntervalSe|
027abf30  63 6f 6e 64 73 22 3a 20  31 38 30 30 0a 7d ff ff  |conds": 
1800.}..|
027abf40  31 18 10 06 b0 82 9e a9  a4 2f 00 00 00 00 00 00 
 |1......../......|

- Tried using flash_erase, then flashcp -v -- the flashcp would fail 
readback verification:
root at zcu102-zynqmp:~# flashcp -v rootfs-image-zcu102-zynqmp.ubimg 
/dev/mtd2
Erasing blocks: 615/615 (100%)
Writing data: 78720k/78720k (100%)
Verifying data: 63490k/78720k (80%)File does not seem to match flash data. 
First mismatch at 0x03dfe000-0x03e00800

- I reduced the filesystem size so that I no longer received these flashcp 
verification errors, however I still saw garbled files on the rootfs
- I ran the mtd tests for flash_speed, flash_stress, flash_readtest -- all 
passing

I then started tracing through UBIFS, UBI, SPI-NOR, and eventually the 
ZynqMP QSPI driver. Based on the previous issues (flash_lock/unlock 
support not properly implemented, documentation incorrect with respect to 
controller behavior, u-boot driver lacking IO-Mode support making UBI 
U-Boot support useless) I've found with the Xilinx kernel driver I was 
expecting this one to also be another Xilinx issue -- and sure enough, it 
is.

I set up the kernel driver to dump out various information about what it 
was attempting to do. When trying to cat the /etc/mender/mender.conf file, 
it was trying to read 350 bytes from flash to pass back to the UBI layer. 
A partial dump of this data is shown below:


[ 2179.619765] Byte 0x30 = {
[ 2179.622379] Byte 0x31 =
[ 2179.622379]
[ 2179.626368] Byte 0x32 =
[ 2179.628966] Byte 0x33 =
[ 2179.631578] Byte 0x34 =
[ 2179.634174] Byte 0x35 =
[ 2179.636778] Byte 0x36 = "
[ 2179.639382] Byte 0x37 = I
[ 2179.641995] Byte 0x38 = n
[ 2179.644590] Byte 0x39 = v
[ 2179.647194] Byte 0x3A = e
[ 2179.649798] Byte 0x3B = n
[ 2179.652411] Byte 0x3C = t
[ 2179.655006] Byte 0x3D = o
[ 2179.657610] Byte 0x3E = r
[ 2179.660214] Byte 0x3F = y
[ 2179.662827] Byte 0x40 = P
[ 2179.665423] Byte 0x41 = o
[ 2179.668026] Byte 0x42 = l
[ 2179.670639] Byte 0x43 = l
[ 2179.673235] Byte 0x44 = I
[ 2179.675838] Byte 0x45 = n
[ 2179.678442] Byte 0x46 = t
[ 2179.681055] Byte 0x47 = e
[ 2179.683651] Byte 0x48 = r
[ 2179.686254] Byte 0x49 = v
[ 2179.688858] Byte 0x4A = a
[ 2179.691471] Byte 0x4B = l
[ 2179.694067] Byte 0x4C = S
[ 2179.696671] Byte 0x4D = e
[ 2179.699274] Byte 0x4E = c
[ 2179.701887] Byte 0x4F = o
[ 2179.704483] Byte 0x50 = n
[ 2179.707086] Byte 0x51 = d
[ 2179.709691] Byte 0x52 = s
[ 2179.712303] Byte 0x53 = "
[ 2179.714899] Byte 0x54 = :
[ 2179.717503] Byte 0x55 =
[ 2179.720107] Byte 0x56 = 1
[ 2179.722720] Byte 0x57 = 8
[ 2179.725315] Byte 0x58 = 0
[ 2179.727919] Byte 0x59 = 0
[ 2179.730532] Byte 0x5A = ,
[ 2179.733127] Byte 0x5B =
[ 2179.733127]
[ 2179.737126] Byte 0x5C =
[ 2179.737126]
[ 2179.741128] Byte 0x5D = }
[ 2179.743717] Byte 0x5E =
[ 2179.746321] Byte 0x5F =
[ 2179.748924] Byte 0x60 = "
[ 2179.751538] Byte 0x61 = R
[ 2179.754133] Byte 0x62 = e
[ 2179.756737] Byte 0x63 = t
[ 2179.759341] Byte 0x64 = r
[ 2179.761954] Byte 0x65 = y
[ 2179.764549] Byte 0x66 = P
[ 2179.767153] Byte 0x67 = o
[ 2179.769757] Byte 0x68 = l
[ 2179.772373] Byte 0x69 = l


Based on this data dump, the driver is obviously returning incorrect 
bytes. Specifically, starting at 0x5C these bytes are not what is 
expected. I am operating this driver in IO-Mode (has-io-mode DTS entry) 
and after digging into the sequence written to the generic FIFO everything 
looked like it should be fine. That is, the genfifo is written with a 
control word indicating an exponential read for 256 bytes, then is written 
with a control word indicating a direct read of 94 bytes -- these should 
total 350 bytes. 

However, based on this data dump for some reason the controller has 
decided to execute the direct read first, followed by the exponential 
read. This can be observed by realizing that there are at least two 
invalid bytes (0x5E, 0x5F) -- these should be the last two bytes of the 
direct read. I've posted previously on the community forums that the 
controller documentation regarding the specification of direct reads is 
incorrect -- a direct read length of 0xFF only reads 255 bytes (the 
documentation incorrectly claims this will result in a 256 byte read). In 
this case the direct read value was 0x5E = 94 bytes so I would expect this 
to result in 94 valid bytes, but because of the bus sizing of this 
controller at least two bytes will be invalid as 94/4 = 23.5 and the bus 
size is 32-bit. I think this would not be an issue if the controller had 
actually executed the exponential read first, but for some reason that did 
not happen (I still have no idea why.. hardware bug?? would seem like a 
large HW bug).

After staring at the driver code for a while I came up with the following 
patch --

>From 02b4e5833941b8f30f97b73f0461c0825fdb3ad6 Mon Sep 17 00:00:00 2001
From: Holden Sandlar <holden.sandlar at ultra-fei.com>
Date: Thu, 12 Oct 2017 11:55:15 -0400
Subject: [PATCH] Fixed Xilinx QSPI driver for case in which exponent mode 
is
 used in conjuction with immediate mode (implies some number of bytes not
 divisible by 256), AND the intermediate number of bytes is not evenly
 divisible by 4. In this case, depending on what order the controller 
decides
 to execute transactions there is potential for (imm_data%4) bytes of
 corruption. This patch *hopefully* resolves this XILINX ISSUE

---
 drivers/spi/spi-zynqmp-gqspi.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/spi/spi-zynqmp-gqspi.c 
b/drivers/spi/spi-zynqmp-gqspi.c
index 144acd0..5c1495e 100644
--- a/drivers/spi/spi-zynqmp-gqspi.c
+++ b/drivers/spi/spi-zynqmp-gqspi.c
@@ -988,6 +988,13 @@ static int zynqmp_qspi_start_transfer(struct 
spi_master *master,
                if (imm_data != 0) {
                        genfifoentry &= ~GQSPI_GENFIFO_EXP;
                        genfifoentry &= ~GQSPI_GENFIFO_IMM_DATA_MASK;
+                       if (imm_data % 4 != 0)
+                       {
+                               if(((imm_data + 4 - (imm_data % 4)) & 
0xFF) == 0x00)
+                                       imm_data = 0xFF;
+                               else
+                                       imm_data = imm_data + 4 - 
(imm_data % 4);
+                       }
                        genfifoentry |= (u8) (imm_data & 0xFF);
                        zynqmp_gqspi_write(xqspi,
                                           GQSPI_GEN_FIFO_OFST, 
genfifoentry);
--
1.8.3.1

This patch essentially rounds the number of bytes requested in a direct 
read up to the next multiple of four, and accounts for the case where 
there are 0xFB-0xFF/0xFE remainder bytes in this direct read. I think this 
change still does not properly handle a case where there are 256 remainder 
bytes, but based on the way this driver is written that case should never 
occur as it would be executed in exponent mode. After making this change, 
I no longer get miscompares during ubiformat --confirm and I actually get 
the correct data when I cat files on the resulting booted rootfs. I want 
to also note that this mender.conf file was NOT the only file showing this 
type of issue on the rootfs without this patch -- honestly it's lucky that 
the rootfs booted at all with this kernel driver issue present; there were 
other files in portions of the udev system showing the same problem, I 
just focused on the mender.conf file for my debug process.

I'm not 100% sure this is really the "best" way to fix this driver, but 
I'm glad to be rid of this issue and moving on to other things and hope I 
won't be running into any more week-long Xilinx issue debug...

Regards,
Holden


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.yoctoproject.org/pipermail/meta-xilinx/attachments/20171012/96b15e95/attachment.html>