TROUBLESHOOTING
In July of
2008, we were called in by a client who had a very expensive
custom built computer sitting around that couldn’t be used because
there was a problem. They
called FXO to check it out. (Full
disclosure: I worked on this computer as a direct employee with
this company many years ago).
Knowns: the computer passes the automated self
test program. This is
an extensive program that tests all of the instructions, the
memory, and I/O interfaces to the computer.
However, when they installed this computer in the system
(spacecraft simulator), they were getting indications that there
was an instruction error.
Background:
The Automated Test Program tests all of the
instructions one at a time. For
example, if performing a test of the multiply instruction, then
mostly multiplies will be done and the results checked.
If the divide instruction is being tested, then mostly
divide instructions are executed (of course, there are the memory
fetch & store, compare and branch
instructions needed to check the results and take action).
When the
computer is running in the spacecraft simulator, it is performing
the normal flight software commands using all of the instructions. There is a small piece
of code that runs every 2 seconds in code that checks out the
instruction set of the computer.
That is, while the computer is running flight software, a
short time is taken away to do a self check by executing a small
set of instructions with known data and comparing the result at
the end to the expected result.
This is similar to a memory scrubbing routine that would
run periodically in code to read and write back memory locations
to keep the data and error correction check bits fresh.
Problem:
While the computer was running in the simulator, very frequently
the computer would go “NOT OK” and go off-line with an error in
that part of code of the instruction test.
8201 A88D 828E 141( ) 39 DOSLIT LXX BIG$ LOAD INDEX REGISTER X
8202 B082 8284 130( ) 40 LXY ONE$ LOAD INDEX REGISTER Y
8203 B882 8285 130( ) 41 LXZ NEGI LOAD INDEX REGISTER Z
8204 6064 8268 100( ) 42 LDA ACONST STARTING SHIFT CONSTANT
8205 C403 43 SLL 4 LOGICAL SHIFT ADDS 4 RIGHTMOST ZEROES
8206 C600 44 RAL 1 NOW ROTATE LEFTMOST 1 BIT TO RIGHT
8207 7063 826A 99( ) 45 LMQ QCONST LOAD Q FOR LLS 8208 C084 46 LLS 5
8209 7C79 8282 121( ) 47 SMQ TEMPHOL$,I RETAIN Q PORTION OF RESULT
820A 0C78 8282 120( ) 48 ADM TEMPHOL$,I RETAIN A PORTION OF RESULT
820B C305 49 SRA 6 ALGEBRAIC RIGHT RETAINS SIGN BIT
820C C502 50 SRL 3 LOGICAL RIGHT ZEROES 3 LEFT BITS
820D C701 51 RAR 2 ROTATE 2 BITS FROM LOWER RI&HT TO UPPER LEFT
820E 5064 8272 100( ) 52 AND MINUSAND NOW START LOGICAL TEST
820F 405D 826C 93( ) 53 IOR IORCONST CONTINUE LOGICAL TEST
8210 4C72 8282 114( ) 54 XOR TEMPHOL$,I COMBINE LOGICAL AND SHIFT RESULTS
8211 185F 8270 95( ) 55 ADD ADDCONST START ARITHMETIC TEST
8212 105C 826E 92( ) 56 SUB SUBCONST CONTINUE ARITHMETIC TEST
8213 2861 8274 97( ) 57 MPC THREE$IT CONTINUE ARITHMETIC TEST
8214 6C6F 8283 111( ) 58 STA ANSWER$,I HOLD THE RESULT
8215 C51E 59 SRL 31 INSURE LOWER DIVIDEND FOR DIVIDE TEST
8216 3060 8276 96( ) 60 DVD TWO$IT TEST DIVIDE
8217 1C6C 8283 108( ) 61 ADD ANSWER$,I ADD IN PREVIOUS RESULT
8218 7C6B 8283 107( ) 62 SMQ ANSWER$,I RETAIN DIVIDE REMAINDER
8219 0C6A 8283 106( ) 63 ADM ANSWER$,I RETAIN DIVIDE QUOTIENT ALSO
821A 2058 8272 88( ) 64 MPS MINUSAND MULTIPLY SINGLE TEST
821B 0C68 8283 104( ) 65 ADM ANSWER$,I COMBINED IN FINAL RESULT OF SLIT
821C C801 70 XFR N,A LOAD A ZERO
821D 8062 827F 98( ) 71 BOP ERROR$ 0 IS NOT + ON THIS MACHINE
821E 8861 827F 97( ) 72 BON ERROR$ 0 IS NOT - ON ANY MACHINE
821F 905F 827E 95( ) 73 BOZ TRYPLUS$ THIS IS THE RIGHT JUMP
8220 D02B 824C 44( ) 74 BUI ERROR JUST IN CASE
8221 6200 75 TRYPLUS LDA 0(Y) LOAD 1 VIA Y INDEX
8222 905D 827F 93( ) 76 BOZ ERROR$ WRONG
8223 885C 827F 92( ) 77 BON ERROR$ WRONG
8224 8059 827D 89( ) 78 BOP TRYNEG$ OK
8225 D026 824C 39( ) 79 BUI ERROR JUST IN CASE
8226 6500 80 TRYNEG LDA 0(Z),I LOAD - VALUE VIA Z INDEX INDIRECT
8227 9058 827F 88( ) 81 BOZ ERROR$ NO
8228 8057 827F 87( ) 82 BOP ERROR$ NO
8229 8853 827C 83( ) 83 BON TRYOVFL$ YES
822A D021 824C 34( ) 84 BUI ERROR
822B D85E 8289 94( ) 85 TRYOVFLO RST BIT13ZRO ERASE OVERFLOW BIT
822C 6060 828C 96( ) 86 LDA ONE$IT 1
822D 185F 828C 95( ) 87 ADD ONE$IT
822E F851 827F 81( ) 88 BOV ERROR$ SHOULD NOT OVERFLOW 822F 6300 89 LDA 0(X) LOAD BIG VALUE VIA X INDEX
8230 1B00 90 ADD 0(X) ADD BIG VALUE VIA X INDEX TO CAUSE OVERFLOW
8231 F84A 827B 74( ) 91 BOV TRYDMB$ SHOULD OVERFLOW HERE
8232 D019 824C 26( ) 92 BUI ERROR DIDNT OVERFLOW
8233 6043 8276 67( ) 93 TRYDMB LDA TWO$IT UPPER HALF ONLY
8234 6C4E 8282 78( ) 94 STA TEMPHOL$,I
8235 9C4D 8282 77( ) 95 DMB TEMPHOL$,I
8236 9C4C 8282 76( ) 96 DMB TEMPHOL$,I
8237 D014 824C 21( ) 97 BUI ERROR SHOULD NOT BE HERE
8238 A853 828B 83( ) 98 LXX MINUS1 NOW TRY MXS TEST
8239 D301 99 MXS +1(X) SHOULD CAUSE 0 SKIP
823A D011 824C 18( ) 100 BUI ERROR SHOULD NOT BE HERE
823B B03B 8276 59( ) 101 LXY TWO$IT NOW TRY Y REG
823C D6FE 102 MXS -2(Y) SHOULD SKIP AGAIN
823D D00E 824C 15( ) 103 BUI ERROR NO
823E C854 104 XFR Y,Z 2
823F D5FF 105 MXS -1(Z) NO SKIP NOW
8240 D001 8242 2( ) 106 BUI TRYCMA OK
8241 D00A 824C 11( ) 107 BUI ERROR NO
8242 6034 8276 52( ) 108 TRYCMA LDA TWO$IT 2
8243 5831 8274 49( ) 109 CMA THREE$IT
8244 D007 824C 8( ) 110 BUI ERROR NO
8245 D006 824C 7( ) 111 BUI ERROR NO
8246 5846 828C 70( ) 112 CMA ONE$IT 1
8247 D004 824C 5( ) 113 BUI ERROR NO
8248 D001 824A 2( ) 114 BUI TRYCML OK
8249 D002 824C 3( ) 115 BUI ERROR NO
824A 3842 828C 66( ) 116 TRYCML CML ONE$IT 1
824B D002 824E 3( ) 117 BUI TESTMODE OK
824C 703E 828A 62( ) 118 ERROR LMQ BIT7$IT
824D D010 825E 17( ) 119 BUI ERX$IT **** A KMT B002
121 * NOW TEST WHICH MODE IS BEING PROCESSED
824E C8A1 122 TESTMODE XFR S,A NEED BIT 1 FOR MODE
824F 8840 828F 64( ) 123 BON SINGLE$ SINGLE PREC IF ON
124 MODE D CPU IS IN DOUBLE
8250 6433 8283 51( ) 125 LDA ANSWER$,I FROM SLIT
8251 3C30 8281 48( ) 126 CML DBLANS$,I TEST FOR CORRECT ANSWER
8252 D4F9 824C -6( ) 127 BUI ERROR NG
8253 D009 825D 10( ) 128 BUI EXIT
8254 129 SINGLE DS 0
130 MODE S CPU IS IN SINGLE
8254 A42E 8282 46( ) 131 STP TEMPHOL$,I 8255 6032 8287 50( ) 132 LDA VALIDP WHAT P SHOULD BE
8256 3C2C 8282 44( ) 133 PVALID CML TEMPHOL$,I
8257 D4F4 824C -11( ) 134 BUI ERROR NO
8258 642B 8283 43( ) 135 LDA ANSWER$,I FROM SLIT
8259 3C27 8280 39( ) 136 CML SINGLAN$,I TEST FOR CORRECT ANSWER
825A D4F1 824C -14( ) 137 BUI ERROR NG
825B D82D 8288 45( ) 138 RST BIT1ZERO SET CPU TO DOUBLE FOR SECOND PASS
825C D4A4 8201 -91( ) 139 BUI DOSLIT TRY AGAIN
825D C802 141 EXIT XFR N,Q
825E D92F 828D 47( ) 142 ERX$IT SST BITS12 SET SINGLE PRECISION **** A KMT B015
143 MODE S **** A KMT B016
825F A81A 8279 26( ) 144 LXX XSETTING RESTORE COMPOOL REGISTERS **** A KMT B017
079E 145 USE X,EP$TAB **** A KMT B018
8260 B01A 827A 26( ) 146 LXY YSETTING **** A KMT B019
089E 147 USE Y,EP$TAB+256 **** A KMT B020
8261 C821 148 XFR Q,A ERROR FLAG **** A KMT B021
8262 9383 0821 131(X) 149 BOZ ITEST* RETURN-NO ERROR **** A KMT B022
8263 43F6 0894 246(X) 150 IOR CPUTEST **** A KMT B023
8264 6BF6 0894 246(X) 151 STA CPUTEST **** A KMT B024
8265 F383 0821 131(X) 152 BUA ITEST* RETURN-ERROR **** A KMT B025
153 *
8266 528F 154 DBLANS DC,2 $528F1D6B ADD-A-123076-001
8267 1D6B
8268 7FFF 155 ACONST DC,2 $7FFF7FFF ADD-A-123076-003
8269 7FFF
826A 5555 156 QCONST DC,2 $55555555
826B 5555
826C 1000 157 IORCONST DC,2 $10000000
826D 0000
826E 0ABC 158 SUBCONST DC,2 $0ABCDEF9
826F DEF9
8270 1DEF 159 ADDCONST DC,2 $1DEF2345
8271 2345
8272 FFF7 160 MINUSAND DC,2 $FFF7FFFF ADD-A-123076-005
8273 FFFF
8274 0003 161 THREE$IT DC,2 $00030003
8275 0003
8276 0002 162 TWO$IT DC,2 $00020002
8277 0002
8278 8459 163 SINGLANS DC,1 $8459 ADD-A-123076-007
8279 079E 164 XSETTING DC,1 EP$TAB
827A 089E 165 YSETTING DC,1 EP$TAB+256
827B 8233 166 TRYDMB$ DC,1 TRYDMB
827C 822B 167 TRYOVFL$ DC,1 TRYOVFLO
827D 8226 168 TRYNEG$ DC,1 TRYNEG
827E 8221 169 TRYPLUS$ DC,1 TRYPLUS
827F 824C 170 ERROR$ DC,1 ERROR
8280 8278 171 SINGLAN$ DC,1 SINGLANS
8281 8266 172 DBLANS$ DC,1 DBLANS
8282 0D00 173 TEMPHOL$ DC,1 TEMPHOLD
8283 0D02 174 ANSWER$ DC,1 ANSWER
8284 828C 175 ONE$ DC,1 ONE$IT LOCN OF VALUE
8285 8286 176 NEGI DC,1 NEG$ LOCN OF ADDRESS FOR INDIRECT TEST
8286 8272 177 NEG$ DC,1 MINUSAND LOCN OF VALUE
8287 8256 178 VALIDP DC,1 PVALID FOR STP TEST
8288 7FFF 179 BIT1ZERO DC,1 $7FFF RESET BIT 1
8289 FFF7 180 BIT13ZRO DC,1 $FFF7
181 *
182 *
828A 0100 183 BIT7$IT DC,1 $0100 ***** MUST BE ON EVEN WORD BOUNDARY ******
184 *
185 *
828B FFFF 186 MINUS1 DC,1 $FFFF **** A KMT B030
187 *
188 *
828C 0001 189 ONE$IT DC,1 1 ***** MUST BE ON EVEN WORD BOUNDARY. A NON-ZERO
190 * CONSTANT MUST FOLLOW THIS LOCATION *******
828D C000 191 BITS12 DC,1 $C000 SET BITS 1 AND 2
192 *
193 *
828E 8268 194 BIG$ DC,1 ACONST LOCN OF VALUE
828F 195 ITEST$EN DS 0 03U001JC040187
828F 8254 196 SINGLE$ DC,1 SINGLE
197 END
TROUBLESHOOT
continued: I figured
out how to load this code into the computer on the Automated Test
System. This was a
little tricky since the system is setup just to run canned
programs, but I figured it out!
I ran this
test once and the computer jumped into never-never land.
Ooops, I forgot to handle
where to go after the test is completed.
In the real system, the program returns from interrupt
and goes back into normal flight software.
I had to add some hooks in the test to either stop or
loop back to the beginning. Eventually
I came up with a method to setup a count and output to a set of
lamps on the front panel the count for each time the program goes
thru a successful calculation.
If the test detects an error, the count is frozen and the
test equipment halts the computer.
I found
that the instruction test would run on average about 15 seconds
before getting an error. But
what’s causing it? The
instruction tests a lot of instructions as seen above.
I put a hook in the test to bypass the branch tests
(around address 8221). Even
with no branch tests, the program was failing after 12 to 15
seconds. So that
tells me that the failure was in the arithmetic portion of the
program. Next,
This is a
microcoded machine.
When the computer is on the test station, you can access
the microcode address. I
put a logic analyzer on the 8-bit microcode address bus.
I was able to find a copy of the microcode from 1977. When I run the shorted
multiply and divide test in assembly code, I see on the logic
analyzer the individual microcode addresses running for each
instruction. With
that setup, I can add logic analyzer probes to look at various
control signals while each microcode address is running.
I changed
the program a little bit now to output the hex data 0100 to the
display every time thru the loop that passes and the data 0200 if
there is an error. I
keep the loop running even after detecting an error.
The output data goes to lamps and a test point.
I add the two bits from the test points at Data bit 6
(error bit) and Data bit 7 (pass bit) to the logic analyzer. Note, this computer is a
fractional computer (better used for calculating ephemeris for
space operations) so the leftmost bit is the most significant bit
is called Bit 0. The
least significant bit (in single precision) is the rightmost bit
is bit 15.
Now, with the microcode address and the Pass/Fail bits loaded on the logic analyzer, I could look thru the microcode to see if anything is happening differently from when the program works, to when the program fails. I saw that in correct multiply operations, op code 353 would run 8 times, however in failed operations, we would only get 4 loops of 353. Microcode address 353 is the iterate command for the multiply. That leads me to the ALU board and the iteration counter. An excerpt of the logic is shown above. Sorry for the drawing, this was hand drawn back in the 80’s. This was my working copy.
A lot of troubleshooting is comparing working operations to non-working operations. Even though this computer failed very infrequently (1 out of 10,000 loops), you have to capture the failure condition and compare it to the working operation.
I installed 4 more logic analyzer probes at the “Logic Analyzer 1 here” position, on the output of U101. I installed 2 more logic analyzer clips on the KA, and KB control signals. This chip, U101, is a CD4019 And-Or Select chip. When KA (pin 9) is high, the chip selects the 4 “A” inputs to pass thru from the inputs (pins 7, 4, 2, 15) to the outputs (pins 10, 11, 12, 13). When KB (pin 14) in high, the chip selects the 4 “B” inputs to pass thru from the inputs (pins 7, 5, 3, 1) to the outputs (pins 10, 11, 12, 13). If both KA and KB are low, then the outputs are all zeroes. If both KA and KB are high, then the output is the logic “OR” of each “A” input with the “B” input.
Since the designers of this computer are long retired, I had to go by the signal names to get a hint on what function this circuit is performing. Usually for multiply and divides, the algorithm uses a counter to perform the proper number of add and shift operations. So this logic with the control signals of ITCLD (Iteration Counter Load? …a fair guess) and ITCOUNT (Iteration Counter Count) select whether the outputs of U101 come from the Loaded count (when ITCLD is high) or from the SUM inputs (when ITCOUNT is high). The ITCLD and ITCOUNT signals are microcode ROM outputs. I saw that at micro address 351, the ITCLD signal goes high. When the instruction test passes, the 4 bit data is 1010. However, when the instruction fails, the 4 bit data on the output of U101 is 1000. That third bit from the left correlating to U101-pin 2 is low in the non working operation.
Is U101 bad?
Check by moving the logic analyzer probes to the input of U101 to the “Logic Analyzer 2 here” position to see what inputs it is getting in when the output is known to be incorrect. So my good tech David moved the probes to the output of U24 (input to U101 A inputs). Recall, since the error seems to be occurring when the chip U101 is in the ITCLD state, that is when the “A” inputs are active. So don’t worry at this time about the SUM inputs on the “B” side of U101.
With the probes on U24 outputs and the U24 KA and KB control signals, I see that during the error, the KA inputs were active into U24 and the output was also a data pattern 1000. This correlates to what was coming out of U101. So since U101 inputs were 1000 on the “A” side during the error, the output of 1000 says that U101 is working OK.
Is U24 bad?
Check by moving the logic analyzer probes to the input of U24 to the “Logic Analyzer 3 here” position to see what inputs it is getting in when the output is known to be incorrect. So my good tech David again moved the probes to the output of U23 (input to U24 A inputs). Again, since the error seems to be occurring when the chip U24 is in the KA state, that is when the “A” inputs are active. So don’t worry at this time about the KB inputs on the “B” side of U24.
With the probes on U23 outputs and the U23 KA and KB control signals, I see that during the error, the KA inputs were active into U23 and the output was also a data pattern 1000. This correlates to what was coming out of U101. So since U101 inputs were 1000 on the “A” side during the error, the output of 1000 says that U24 is also working OK.
Is U23 bad?
Check by moving the logic analyzer probes to the input of U23 to the “Logic Analyzer 4 here” position to see what inputs it is getting in when the output is known to be incorrect. So David moved the probes to the output of U21 (input to U23 A inputs). Again, since the error seems to be occurring when the chip U23 is in the KA (SP or single precision state), that is when the “A” inputs are active. So don’t worry at this time about the KB inputs on the “B” side of U23.
With the probes on U21 outputs and the U21 KA and KB control signals, I see that during the error, the KA inputs were active into U21 and the output was also a data pattern 1000. This correlates to what was coming out of U101. So since U101 inputs were 1000 on the “A” side during the error, the output of 1000 says that U23 is also working OK.
Is U21 bad?
Now this is a “simple” circuit. I saw that during the error, KA was selected on U21. This microcode signal MOI goes high during multiply instructions. The KB site control signal DOI goes high during divide instructions. Since the bit in error was coming out of U21 pin 12, we looked especially hard at U21 pin 12 and its inputs on U21 pin 2 (A input) and pin 3 (the B input). We also monitored the KA and KB control signals, pin 9 and 14 respectively. Lo and behold, while we were probing this chip, the instruction test would not fail. We are on to something…
Of course we can’t leave the logic analyzer connected to the chip and fly it that way, there had to be a reason that the circuit works when the logic probes are attached. By then David’s shift was over and we called it quits for the night.
Next morning, another super tech Bill, helped me out. I knew to go right to U21 and probe around. We removed the logic analyzer probes and we can see that the test fails. As soon as Bill put a scope probe on pin 2, the test started passing consistently. I think we are at our problem.
Bill removed the board from the box and looked at pin 2 under a microscope. He could see that pin 2 was not soldered down to the pad. The signal was floating!
Looking at the circuit at U21 (these are all +10V CMOS chips), you can start to understand why the automated tests were not failing when just divide were run or just multiply were run. By running just multiply for example, the MOI signal goes hi and the input from pin 2 passes thru the chip U21 to the output. Since the pin was floating, the signal would most likely (in CMOS) float hi. But the input at U21 pin 2 was Vdd or +10V. So the signal at U21 pin 2 should always be high. Recall, this is the constant that tells the multiply how many iterations to perform. When the divide instruction is run, the MOI signal goes low and the DOI signal goes high selecting the “B” input, pin 3. This signal is hardwired low for the divide constant.
It was only by running this special Instruction test that the Multiply was performed followed by the Divide. This operation passed over ten thousand times before one time when the output of U21 pin 12 did not go high for a multiply. It must be that while KA and KB are changing rapidly, the float at U21 pin 2 could be considered to be a logic low for one time. This was enough to fail the test.
Repair: Resolder U21-2 and retest with the automated test as well as the special instruction test.
Case solved!
If you have any
questions about troubleshooting, please send Fran an email: fran_oconnell@fxoinc.com
Or give me a call at 609-799-6450
The above troubleshoot was done in 40 hours. That was less than $6000.00 including travel costs! Cheap!