Fri 04.01.2019Mesa assembly bug on Respberry Pi VC4 with fix

Mesa assembly bug on Respberry Pi / VC4 with fix

After a few days of banging my head against this and chasing a red herring in gdb and with some help from HdkR I found the problem. Mesa’s VC4 code that uploads (and downloads) texture tiles has incorrect ARM assembly in it. It forgot that some of the input parameters to be held in registers are actually modified by the ASM and declared them as "r" instead of "+r". It seemed a simple enough fix, but unfortunately there were a series of issues that made it more complex.

So first some "r" arguments had to become "+r", but this isn't allowed to just be done with small changes because the compiler requires "+r" arguments to be moved into the OUTPUT section (the top line of arguments starting with : that was empty before) which means positional arguments all change around (The “+r” means the register argument is modified in the assembly, as opposed to "r" meaning it is input only and not modified). Of course manually renaming a bunch of numbers to match a new order is just asking for trouble going and converting %0 to %2 and %2 to %1 and so on, so it's a pretty unsafe way to do it. The better way is to now NAME the arguments. This is done by adding a [name_here] in front of each argument in the output/input lists at the end. This means instead of referring to the argument positionally like %0 and %2 etc. you can now refer to it by name like %[name_here]. This was the first thing I did. so as to not make mistakes later. Even if this is all I did, I’d still have a crashing, not working set of code that can’t be tested.

This unfortunately leads to another problem. One of the original input arguments was a value of cpu + 8. The compiler does not allow this when you name arguments. It has to be a plain variable and not an expression once you use names. So I quickly tried using a temporary variable in C above the assembly block like:

unsigned char *cpu2 = cpu + 8;

(A side mention: The cpu pointer coming in is a void * pointer and doing any math like the original code did is not a great idea as size is not clearly defined).

This just temporary C variable led to the compiler somehow filling cpu2 with junk when it was inside the assembly and not the address of cpu + 8. Chalk this one up to some compiler bug or desire to just do whatever it wanted if you were doing void pointer math. So there was no way to have a working patch in this form because it'd just crash as the 2nd address (cpu2) is wrong. Thus I also had to roll in an add by hand in the assembly block with a temporary register of r0 or w0 on aarch64 to ensure it was correct and actually worked (in theory), but in this case if I hadn't changed the "r" to "+r" it’d still crash for the same original reasons at the top.

So this really necessitates a single patch for both safety/sanity of the change and having it actually compile and run as a unit. If I did just the name change first stand-alone, it can't compile thanks to the cpu + 8 and thus it can't pass test suites or be useful. If I just put in a line of C outside and convert cpu + 8 to cpu2 as above, then it generates incorrect code (the values I saw in gdb for that register were total junk and nothing like the cpu address). This also would be a bit more change in 1 patch but wouldn't run or pass any test suites. I could add the inline assembly to do the add as I did and make this a single patch, But at least for me this would still crash and would not be testable as thing would still fail and crash, so it's be an untested patch, thus it really needed to also merge in the changes to "r" to become "+r" in some cases to actually be a compile-able and testable unit on its own.

Sometimes life gives you lemons, so you have to make lemonade.

What to learn?

What can we learn from this?

Firstly, when doing your inline assembly, first it's always easier to read, maintain and modify if you use named, rather than positional arguments. If you have to move arguments around (they go from input to output or something) you don't have to do a risky re-do of all the positional arguments. It'll make your days saner reading the code anyway.

Secondly, be careful about how to tag your arguments. Are they modified in the assembly block? Read the manuals carefully for your instructions to be sure. The bug may appear years later, long after you or someone else wrote that code, and a more aggressively optimizing compiler might just now break it. Read your docs on inline assembly and these notations and get them right. Also don't forget to put any registers you use by hand inside in the clobbered list (last set of values at the end. This was correct already but any new registers you use, add them to this, or if you stop using them remove them from here).

Also just as a note - don't do math on pointers that are not clearly sized, so don't do void *ptr1 = x; void *ptr2 = ptr1 + 8;. Use a char * at least if you're doing your address math byte by byte.

The patch

If anyone is having problems with a fairly freshly compiled mesa with a fresh compiler then you might want to try the below patch to fix it. I've sent this off to Eric Anholt so it'll hopefully go into Mesa some time soon.

From fa297372cec1f964f7946d7de55c28920052ab8b Mon Sep 17 00:00:00 2001
From: Carsten Haitzler
Date: Wed, 2 Jan 2019 18:56:23 +0000
Subject: [PATCH 10/10] vc4: Fix NEON and aarch64 ASM for up/download of textures

The neon ASM didn't properly declare variables that go into
registers that are being modified in the ASM thus confusing
the compiler and generating bad code and segfaults on
texture upload at least. This fixes that and properly
declares registers with a "+" for input and output.

It also has to move these to another section (they must be
in the output section, rather than input). It also uses a
more readable naming of the variables that go into registers
so the code is easier to follow and maintain (had to do
this just as part of fixing it to make it sane to debug
and modify/fix).

This seems to have happened as a result of a new compiler
(gcc 8.2) at least and generates the bad code because
the inline functions are being inlined in a way where
registers that contain pointers are being re-used
since they were not declared to have changed. I noticed
that compiling mesa with -O0 made the problem go away
(turned off inlining thus the above issue). -O1 and
above and it happened.

I only tested this on 32bit ARM. The same problem existed
in theory for aarch64 but I didn't see the issue there. I
fixed the aarch64 ASM too.
---
 src/broadcom/common/v3d_cpu_tiling.h | 200 +++++++++++++++------------
 1 file changed, 108 insertions(+), 92 deletions(-)

diff --git a/src/broadcom/common/v3d_cpu_tiling.h b/src/broadcom/common/v3d_cpu_tiling.h
index 82ac2aef6a..ac3d0135b8 100644
--- a/src/broadcom/common/v3d_cpu_tiling.h
+++ b/src/broadcom/common/v3d_cpu_tiling.h
@@ -37,20 +37,22 @@ v3d_load_utile(void *cpu, uint32_t cpu_stride,
                         /* Load from the GPU in one shot, no interleave, to
                          * d0-d7.
                          */
-                        "vldm %0, {q0, q1, q2, q3}\n"
+                        ".fpu neon\n"
+                        "vldm %[gpu], {q0, q1, q2, q3}\n"
                         /* Store each 8-byte line to cpu-side destination,
                          * incrementing it by the stride each time.
                          */
-                        "vst1.8 d0, [%1], %2\n"
-                        "vst1.8 d1, [%1], %2\n"
-                        "vst1.8 d2, [%1], %2\n"
-                        "vst1.8 d3, [%1], %2\n"
-                        "vst1.8 d4, [%1], %2\n"
-                        "vst1.8 d5, [%1], %2\n"
-                        "vst1.8 d6, [%1], %2\n"
-                        "vst1.8 d7, [%1]\n"
-                        :
-                        : "r"(gpu), "r"(cpu), "r"(cpu_stride)
+                        "vst1.8 d0, [%[cpu]], %[cpu_stride]\n"
+                        "vst1.8 d1, [%[cpu]], %[cpu_stride]\n"
+                        "vst1.8 d2, [%[cpu]], %[cpu_stride]\n"
+                        "vst1.8 d3, [%[cpu]], %[cpu_stride]\n"
+                        "vst1.8 d4, [%[cpu]], %[cpu_stride]\n"
+                        "vst1.8 d5, [%[cpu]], %[cpu_stride]\n"
+                        "vst1.8 d6, [%[cpu]], %[cpu_stride]\n"
+                        "vst1.8 d7, [%[cpu]]\n"
+                        : [cpu]        "+r"(cpu)
+                        : [gpu]         "r"(gpu),
+                          [cpu_stride]  "r"(cpu_stride)
                         : "q0", "q1", "q2", "q3");
                 return;
         } else if (gpu_stride == 16) {
@@ -58,22 +60,25 @@ v3d_load_utile(void *cpu, uint32_t cpu_stride,
                         /* Load from the GPU in one shot, no interleave, to
                          * d0-d7.
                          */
-                        "vldm %0, {q0, q1, q2, q3};\n"
+                        ".fpu neon\n"
+                        "add r0, %[cpu], #8\n" /* Work around compiler bug */
+                        "vldm %[gpu], {q0, q1, q2, q3};\n"
                         /* Store each 16-byte line in 2 parts to the cpu-side
                          * destination.  (vld1 can only store one d-register
                          * at a time).
                          */
-                        "vst1.8 d0, [%1], %3\n"
-                        "vst1.8 d1, [%2], %3\n"
-                        "vst1.8 d2, [%1], %3\n"
-                        "vst1.8 d3, [%2], %3\n"
-                        "vst1.8 d4, [%1], %3\n"
-                        "vst1.8 d5, [%2], %3\n"
-                        "vst1.8 d6, [%1]\n"
-                        "vst1.8 d7, [%2]\n"
-                        :
-                        : "r"(gpu), "r"(cpu), "r"(cpu + 8), "r"(cpu_stride)
-                        : "q0", "q1", "q2", "q3");
+                        "vst1.8 d0, [%[cpu]], %[cpu_stride]\n"
+                        "vst1.8 d1, [r0],     %[cpu_stride]\n"
+                        "vst1.8 d2, [%[cpu]], %[cpu_stride]\n"
+                        "vst1.8 d3, [r0],     %[cpu_stride]\n"
+                        "vst1.8 d4, [%[cpu]], %[cpu_stride]\n"
+                        "vst1.8 d5, [r0],     %[cpu_stride]\n"
+                        "vst1.8 d6, [%[cpu]]\n"
+                        "vst1.8 d7, [r0]\n"
+                        : [cpu]        "+r"(cpu)
+                        : [gpu]         "r"(gpu),
+                          [cpu_stride]  "r"(cpu_stride)
+                        : "r0", "q0", "q1", "q2", "q3");
                 return;
         }
 #elif defined (PIPE_ARCH_AARCH64)
@@ -82,20 +87,21 @@ v3d_load_utile(void *cpu, uint32_t cpu_stride,
                         /* Load from the GPU in one shot, no interleave, to
                          * d0-d7.
                          */
-                        "ld1 {v0.2d, v1.2d, v2.2d, v3.2d}, [%0]\n"
+                        "ld1 {v0.2d, v1.2d, v2.2d, v3.2d}, [%[gpu]]\n"
                         /* Store each 8-byte line to cpu-side destination,
                          * incrementing it by the stride each time.
                          */
-                        "st1 {v0.D}[0], [%1], %2\n"
-                        "st1 {v0.D}[1], [%1], %2\n"
-                        "st1 {v1.D}[0], [%1], %2\n"
-                        "st1 {v1.D}[1], [%1], %2\n"
-                        "st1 {v2.D}[0], [%1], %2\n"
-                        "st1 {v2.D}[1], [%1], %2\n"
-                        "st1 {v3.D}[0], [%1], %2\n"
-                        "st1 {v3.D}[1], [%1]\n"
-                        :
-                        : "r"(gpu), "r"(cpu), "r"(cpu_stride)
+                        "st1 {v0.D}[0], [%[cpu]], %[cpu_stride]\n"
+                        "st1 {v0.D}[1], [%[cpu]], %[cpu_stride]\n"
+                        "st1 {v1.D}[0], [%[cpu]], %[cpu_stride]\n"
+                        "st1 {v1.D}[1], [%[cpu]], %[cpu_stride]\n"
+                        "st1 {v2.D}[0], [%[cpu]], %[cpu_stride]\n"
+                        "st1 {v2.D}[1], [%[cpu]], %[cpu_stride]\n"
+                        "st1 {v3.D}[0], [%[cpu]], %[cpu_stride]\n"
+                        "st1 {v3.D}[1], [%[cpu]]\n"
+                        : [cpu]        "+r"(cpu)
+                        : [gpu]         "r"(gpu),
+                          [cpu_stride]  "r"(cpu_stride)
                         : "v0", "v1", "v2", "v3");
                 return;
         } else if (gpu_stride == 16) {
@@ -103,22 +109,24 @@ v3d_load_utile(void *cpu, uint32_t cpu_stride,
                         /* Load from the GPU in one shot, no interleave, to
                          * d0-d7.
                          */
-                        "ld1 {v0.2d, v1.2d, v2.2d, v3.2d}, [%0]\n"
+                        "add w0, %[cpu], #8\n" /* Work around compiler bug */
+                        "ld1 {v0.2d, v1.2d, v2.2d, v3.2d}, [%[gpu]]\n"
                         /* Store each 16-byte line in 2 parts to the cpu-side
                          * destination.  (vld1 can only store one d-register
                          * at a time).
                          */
-                        "st1 {v0.D}[0], [%1], %3\n"
-                        "st1 {v0.D}[1], [%2], %3\n"
-                        "st1 {v1.D}[0], [%1], %3\n"
-                        "st1 {v1.D}[1], [%2], %3\n"
-                        "st1 {v2.D}[0], [%1], %3\n"
-                        "st1 {v2.D}[1], [%2], %3\n"
-                        "st1 {v3.D}[0], [%1]\n"
-                        "st1 {v3.D}[1], [%2]\n"
-                        :
-                        : "r"(gpu), "r"(cpu), "r"(cpu + 8), "r"(cpu_stride)
-                        : "v0", "v1", "v2", "v3");
+                        "st1 {v0.D}[0], [%[cpu]], %[cpu_stride]\n"
+                        "st1 {v0.D}[1], [w0],     %[cpu_stride]\n"
+                        "st1 {v1.D}[0], [%[cpu]], %[cpu_stride]\n"
+                        "st1 {v1.D}[1], [w0],     %[cpu_stride]\n"
+                        "st1 {v2.D}[0], [%[cpu]], %[cpu_stride]\n"
+                        "st1 {v2.D}[1], [w0],     %[cpu_stride]\n"
+                        "st1 {v3.D}[0], [%[cpu]]\n"
+                        "st1 {v3.D}[1], [w0]\n"
+                        : [cpu]        "+r"(cpu)
+                        : [gpu]         "r"(gpu),
+                          [cpu_stride]  "r"(cpu_stride)
+                        : "w0", "v0", "v1", "v2", "v3");
                 return;
         }
 #endif
@@ -139,20 +147,22 @@ v3d_store_utile(void *gpu, uint32_t gpu_stride,
                         /* Load each 8-byte line from cpu-side source,
                          * incrementing it by the stride each time.
                          */
-                        "vld1.8 d0, [%1], %2\n"
-                        "vld1.8 d1, [%1], %2\n"
-                        "vld1.8 d2, [%1], %2\n"
-                        "vld1.8 d3, [%1], %2\n"
-                        "vld1.8 d4, [%1], %2\n"
-                        "vld1.8 d5, [%1], %2\n"
-                        "vld1.8 d6, [%1], %2\n"
-                        "vld1.8 d7, [%1]\n"
+                        ".fpu neon\n"
+                        "vld1.8 d0, [%[cpu]], %[cpu_stride]\n"
+                        "vld1.8 d1, [%[cpu]], %[cpu_stride]\n"
+                        "vld1.8 d2, [%[cpu]], %[cpu_stride]\n"
+                        "vld1.8 d3, [%[cpu]], %[cpu_stride]\n"
+                        "vld1.8 d4, [%[cpu]], %[cpu_stride]\n"
+                        "vld1.8 d5, [%[cpu]], %[cpu_stride]\n"
+                        "vld1.8 d6, [%[cpu]], %[cpu_stride]\n"
+                        "vld1.8 d7, [%[cpu]]\n"
                         /* Load from the GPU in one shot, no interleave, to
                          * d0-d7.
                          */
-                        "vstm %0, {q0, q1, q2, q3}\n"
-                        :
-                        : "r"(gpu), "r"(cpu), "r"(cpu_stride)
+                        "vstm %[gpu], {q0, q1, q2, q3}\n"
+                        : [cpu]        "+r"(cpu)
+                        : [gpu]         "r"(gpu),
+                          [cpu_stride]  "r"(cpu_stride)
                         : "q0", "q1", "q2", "q3");
                 return;
         } else if (gpu_stride == 16) {
@@ -161,19 +171,22 @@ v3d_store_utile(void *gpu, uint32_t gpu_stride,
                          * destination.  (vld1 can only store one d-register
                          * at a time).
                          */
-                        "vld1.8 d0, [%1], %3\n"
-                        "vld1.8 d1, [%2], %3\n"
-                        "vld1.8 d2, [%1], %3\n"
-                        "vld1.8 d3, [%2], %3\n"
-                        "vld1.8 d4, [%1], %3\n"
-                        "vld1.8 d5, [%2], %3\n"
-                        "vld1.8 d6, [%1]\n"
-                        "vld1.8 d7, [%2]\n"
+                        ".fpu neon\n"
+                        "add r0, %[cpu], #8\n" /* Work around compiler bug */
+                        "vld1.8 d0, [%[cpu]], %[cpu_stride]\n"
+                        "vld1.8 d1, [r0],     %[cpu_stride]\n"
+                        "vld1.8 d2, [%[cpu]], %[cpu_stride]\n"
+                        "vld1.8 d3, [r0],     %[cpu_stride]\n"
+                        "vld1.8 d4, [%[cpu]], %[cpu_stride]\n"
+                        "vld1.8 d5, [r0],     %[cpu_stride]\n"
+                        "vld1.8 d6, [%[cpu]]\n"
+                        "vld1.8 d7, [r0]\n"
                         /* Store to the GPU in one shot, no interleave. */
-                        "vstm %0, {q0, q1, q2, q3}\n"
-                        :
-                        : "r"(gpu), "r"(cpu), "r"(cpu + 8), "r"(cpu_stride)
-                        : "q0", "q1", "q2", "q3");
+                        "vstm %[gpu], {q0, q1, q2, q3}\n"
+                        : [cpu]        "+r"(cpu)
+                        : [gpu]         "r"(gpu),
+                          [cpu_stride]  "r"(cpu_stride)
+                        : "r0", "q0", "q1", "q2", "q3");
                 return;
         }
 #elif defined (PIPE_ARCH_AARCH64)
@@ -182,18 +195,19 @@ v3d_store_utile(void *gpu, uint32_t gpu_stride,
                         /* Load each 8-byte line from cpu-side source,
                          * incrementing it by the stride each time.
                          */
-                        "ld1 {v0.D}[0], [%1], %2\n"
-                        "ld1 {v0.D}[1], [%1], %2\n"
-                        "ld1 {v1.D}[0], [%1], %2\n"
-                        "ld1 {v1.D}[1], [%1], %2\n"
-                        "ld1 {v2.D}[0], [%1], %2\n"
-                        "ld1 {v2.D}[1], [%1], %2\n"
-                        "ld1 {v3.D}[0], [%1], %2\n"
-                        "ld1 {v3.D}[1], [%1]\n"
+                        "ld1 {v0.D}[0], [%[cpu]], %[cpu_stride]\n"
+                        "ld1 {v0.D}[1], [%[cpu]], %[cpu_stride]\n"
+                        "ld1 {v1.D}[0], [%[cpu]], %[cpu_stride]\n"
+                        "ld1 {v1.D}[1], [%[cpu]], %[cpu_stride]\n"
+                        "ld1 {v2.D}[0], [%[cpu]], %[cpu_stride]\n"
+                        "ld1 {v2.D}[1], [%[cpu]], %[cpu_stride]\n"
+                        "ld1 {v3.D}[0], [%[cpu]], %[cpu_stride]\n"
+                        "ld1 {v3.D}[1], [%[cpu]]\n"
                         /* Store to the GPU in one shot, no interleave. */
-                        "st1 {v0.2d, v1.2d, v2.2d, v3.2d}, [%0]\n"
-                        :
-                        : "r"(gpu), "r"(cpu), "r"(cpu_stride)
+                        "st1 {v0.2d, v1.2d, v2.2d, v3.2d}, [%[gpu]]\n"
+                        : [cpu]        "+r"(cpu)
+                        : [gpu]         "r"(gpu),
+                          [cpu_stride]  "r"(cpu_stride)
                         : "v0", "v1", "v2", "v3");
                 return;
         } else if (gpu_stride == 16) {
@@ -202,19 +216,21 @@ v3d_store_utile(void *gpu, uint32_t gpu_stride,
                          * destination.  (vld1 can only store one d-register
                          * at a time).
                          */
-                        "ld1 {v0.D}[0], [%1], %3\n"
-                        "ld1 {v0.D}[1], [%2], %3\n"
-                        "ld1 {v1.D}[0], [%1], %3\n"
-                        "ld1 {v1.D}[1], [%2], %3\n"
-                        "ld1 {v2.D}[0], [%1], %3\n"
-                        "ld1 {v2.D}[1], [%2], %3\n"
-                        "ld1 {v3.D}[0], [%1]\n"
-                        "ld1 {v3.D}[1], [%2]\n"
+                        "add w0, %[cpu], #8\n" /* Work around compiler bug */
+                        "ld1 {v0.D}[0], [%[cpu]], %[cpu_stride]\n"
+                        "ld1 {v0.D}[1], [w0],     %[cpu_stride]\n"
+                        "ld1 {v1.D}[0], [%[cpu]], %[cpu_stride]\n"
+                        "ld1 {v1.D}[1], [w0],     %[cpu_stride]\n"
+                        "ld1 {v2.D}[0], [%[cpu]], %[cpu_stride]\n"
+                        "ld1 {v2.D}[1], [w0],     %[cpu_stride]\n"
+                        "ld1 {v3.D}[0], [%[cpu]]\n"
+                        "ld1 {v3.D}[1], [w0]\n"
                         /* Store to the GPU in one shot, no interleave. */
-                        "st1 {v0.2d, v1.2d, v2.2d, v3.2d}, [%0]\n"
-                        :
-                        : "r"(gpu), "r"(cpu), "r"(cpu + 8), "r"(cpu_stride)
-                        : "v0", "v1", "v2", "v3");
+                        "st1 {v0.2d, v1.2d, v2.2d, v3.2d}, [%[gpu]]\n"
+                        : [cpu]        "+r"(cpu)
+                        : [gpu]         "r"(gpu),
+                          [cpu_stride]  "r"(cpu_stride)
+                        : "w0", "v0", "v1", "v2", "v3");
                 return;
         }
 #endif
--
2.20.1

Copyright © Carsten Haitzler | raster@rasterman.com | CV/Resumé