Troubleshooting

In July of 2008, we were called in by a client who had a very expensive custom built computer sitting around that couldn’t be used because there was a problem. They called FXO to check it out. (Full disclosure: I worked on this computer as a direct employee with this company many years ago).

Knowns: the computer passes the automated self test program. This is an extensive program that tests all of the instructions, the memory, and I/O interfaces to the computer. However, when they installed this computer in the system (spacecraft simulator), they were getting indications that there was an instruction error.

Background: The Automated Test Program tests all of the instructions one at a time. For example, if performing a test of the multiply instruction, then mostly multiplies will be done and the results checked. If the divide instruction is being tested, then mostly divide instructions are executed (of course, there are the memory fetch & store, compare and branch instructions needed to check the results and take action).

When the computer is running in the spacecraft simulator, it is performing the normal flight software commands using all of the instructions. There is a small piece of code that runs every 2 seconds in code that checks out the instruction set of the computer. That is, while the computer is running flight software, a short time is taken away to do a self check by executing a small set of instructions with known data and comparing the result at the end to the expected result. This is similar to a memory scrubbing routine that would run periodically in code to read and write back memory locations to keep the data and error correction check bits fresh.

Problem: While the computer was running in the simulator, very frequently the computer would go “NOT OK” and go off-line with an error in that part of code of the instruction test.

TROUBLESHOOT: Of course we took the computer back and ran the automated self test. The self test passes again. We have a tricky problem. The flight software engineer, Euclid, gave me a soft copy of the flight software that is run in the instruction test routine (below in its entirety)

Link to software here… TBD

TROUBLESHOOT continued: I figured out how to load this code into the computer on the Automated Test System. This was a little tricky since the system is setup just to run canned programs, but I figured it out!

I ran this test once and the computer jumped into never-never land. Ooops, I forgot to handle where to go after the test is completed. In the real system, the program returns from interrupt and goes back into normal flight software. I had to add some hooks in the test to either stop or loop back to the beginning. Eventually I came up with a method to setup a count and output to a set of lamps on the front panel the count for each time the program goes thru a successful calculation. If the test detects an error, the count is frozen and the test equipment halts the computer.

I found that the instruction test would run on average about 15 seconds before getting an error. But what’s causing it? The instruction tests a lot of instructions as seen above. I put a hook in the test to bypass the branch tests (around address 8221). Even with no branch tests, the program was failing after 12 to 15 seconds. So that tells me that the failure was in the arithmetic portion of the program. Next, Euclid calculated what all the operands were to be going into and out of each instruction. For example going into an ADD instruction may have been 0002 in the Accumulator and 3333 in the memory location. After the ADD instruction you would expect 3335 in the accumulator and 3333 still in memory. With this trace, I was able to fake out the program by loading the expected data into the memory or the accumulator and executing the program starting midway into the program. So instead of starting at address 8201 with the LXX command, I’ll start at address 8239 after preloading the accumulator and memory locations manually before starting at address 8229. This way, I remove the code from 8201 to 8228 as the instigator of the anomaly. Sure enough, when I ran from 8229 (the multiply instruction), I was failing at 10 second intervals. I looped on the multiply instruction but the test didn’t fail. I looped on the divide instruction and the test didn’t fail. It was only when I combined running the multiply and divide did the test fail. So that lead me to start probing the ALU board in the computer.

This is a microcoded machine. When the computer is on the test station, you can access the microcode address. I put a logic analyzer on the 8-bit microcode address bus. I was able to find a copy of the microcode from 1977. When I run the shorted multiply and divide test in assembly code, I see on the logic analyzer the individual microcode addresses running for each instruction. With that setup, I can add logic analyzer probes to look at various control signals while each microcode address is running.

I changed the program a little bit now to output the hex data 0100 to the display every time thru the loop that passes and the data 0200 if there is an error. I keep the loop running even after detecting an error. The output data goes to lamps and a test point. I add the two bits from the test points at Data bit 6 (error bit) and Data bit 7 (pass bit) to the logic analyzer. Note, this computer is a fractional computer (better used for calculating ephemeris for space operations) so the leftmost bit is the most significant bit is called Bit 0. The least significant bit (in single precision) is the rightmost bit is bit 15.

Now, with the microcode address and the Pass/Fail bits loaded on the logic analyzer, I could look thru the microcode to see if anything is happening differently from when the program works, to when the program fails. I saw that in correct multiply operations, op code 353 would run 8 times, however in failed operations, we would only get 4 loops of 353. Microcode address 353 is the iterate command for the multiply. That leads me to the ALU board and the iteration counter. An excerpt of the logic is shown above. Sorry for the drawing, this was hand drawn back in the 80’s. This was my working copy.

A lot of troubleshooting is comparing working operations to non-working operations. Even though this computer failed very infrequently (1 out of 10,000 loops), you have to capture the failure condition and compare it to the working operation.

I installed 4 more logic analyzer probes at the “Logic Analyzer 1 here” position, on the output of U101. I installed 2 more logic analyzer clips on the KA, and KB control signals. This chip, U101, is a CD4019 And-Or Select chip. When KA (pin 9) is high, the chip selects the 4 “A” inputs to pass thru from the inputs (pins 7, 4, 2, 15) to the outputs (pins 10, 11, 12, 13). When KB (pin 14) in high, the chip selects the 4 “B” inputs to pass thru from the inputs (pins 7, 5, 3, 1) to the outputs (pins 10, 11, 12, 13). If both KA and KB are low, then the outputs are all zeroes. If both KA and KB are high, then the output is the logic “OR” of each “A” input with the “B” input.

Since the designers of this computer are long retired, I had to go by the signal names to get a hint on what function this circuit is performing. Usually for multiply and divides, the algorithm uses a counter to perform the proper number of add and shift operations. So this logic with the control signals of ITCLD (Iteration Counter Load? …a fair guess) and ITCOUNT (Iteration Counter Count) select whether the outputs of U101 come from the Loaded count (when ITCLD is high) or from the SUM inputs (when ITCOUNT is high). The ITCLD and ITCOUNT signals are microcode ROM outputs. I saw that at micro address 351, the ITCLD signal goes high. When the instruction test passes, the 4 bit data is 1010. However, when the instruction fails, the 4 bit data on the output of U101 is 1000. That third bit from the left correlating to U101-pin 2 is low in the non working operation.

Is U101 bad?

Check by moving the logic analyzer probes to the input of U101 to the “Logic Analyzer 2 here” position to see what inputs it is getting in when the output is known to be incorrect. So my good tech David moved the probes to the output of U24 (input to U101 A inputs). Recall, since the error seems to be occurring when the chip U101 is in the ITCLD state, that is when the “A” inputs are active. So don’t worry at this time about the SUM inputs on the “B” side of U101.

With the probes on U24 outputs and the U24 KA and KB control signals, I see that during the error, the KA inputs were active into U24 and the output was also a data pattern 1000. This correlates to what was coming out of U101. So since U101 inputs were 1000 on the “A” side during the error, the output of 1000 says that U101 is working OK.

Is U24 bad?

Check by moving the logic analyzer probes to the input of U24 to the “Logic Analyzer 3 here” position to see what inputs it is getting in when the output is known to be incorrect. So my good tech David again moved the probes to the output of U23 (input to U24 A inputs). Again, since the error seems to be occurring when the chip U24 is in the KA state, that is when the “A” inputs are active. So don’t worry at this time about the KB inputs on the “B” side of U24.

With the probes on U23 outputs and the U23 KA and KB control signals, I see that during the error, the KA inputs were active into U23 and the output was also a data pattern 1000. This correlates to what was coming out of U101. So since U101 inputs were 1000 on the “A” side during the error, the output of 1000 says that U24 is also working OK.

Is U23 bad?

Check by moving the logic analyzer probes to the input of U23 to the “Logic Analyzer 4 here” position to see what inputs it is getting in when the output is known to be incorrect. So David moved the probes to the output of U21 (input to U23 A inputs). Again, since the error seems to be occurring when the chip U23 is in the KA (SP or single precision state), that is when the “A” inputs are active. So don’t worry at this time about the KB inputs on the “B” side of U23.

With the probes on U21 outputs and the U21 KA and KB control signals, I see that during the error, the KA inputs were active into U21 and the output was also a data pattern 1000. This correlates to what was coming out of U101. So since U101 inputs were 1000 on the “A” side during the error, the output of 1000 says that U23 is also working OK.

Is U21 bad?

Now this is a “simple” circuit. I saw that during the error, KA was selected on U21. This microcode signal MOI goes high during multiply instructions. The KB site control signal DOI goes high during divide instructions. Since the bit in error was coming out of U21 pin 12, we looked especially hard at U21 pin 12 and its inputs on U21 pin 2 (A input) and pin 3 (the B input). We also monitored the KA and KB control signals, pin 9 and 14 respectively. Lo and behold, while we were probing this chip, the instruction test would not fail. We are on to something…

Of course we can’t leave the logic analyzer connected to the chip and fly it that way, there had to be a reason that the circuit works when the logic probes are attached. By then David’s shift was over and we called it quits for the night.

Next morning, another super tech Bill, helped me out. I knew to go right to U21 and probe around. We removed the logic analyzer probes and we can see that the test fails. As soon as Bill put a scope probe on pin 2, the test started passing consistently. I think we are at our problem.

Bill removed the board from the box and looked at pin 2 under a microscope. He could see that pin 2 was not soldered down to the pad. The signal was floating!

Looking at the circuit at U21 (these are all +10V CMOS chips), you can start to understand why the automated tests were not failing when just divide were run or just multiply were run. By running just multiply for example, the MOI signal goes hi and the input from pin 2 passes thru the chip U21 to the output. Since the pin was floating, the signal would most likely (in CMOS) float hi. But the input at U21 pin 2 was Vdd or +10V. So the signal at U21 pin 2 should always be high. Recall, this is the constant that tells the multiply how many iterations to perform. When the divide instruction is run, the MOI signal goes low and the DOI signal goes high selecting the “B” input, pin 3. This signal is hardwired low for the divide constant.

It was only by running this special Instruction test that the Multiply was performed followed by the Divide. This operation passed over ten thousand times before one time when the output of U21 pin 12 did not go high for a multiply. It must be that while KA and KB are changing rapidly, the float at U21 pin 2 could be considered to be a logic low for one time. This was enough to fail the test.

Repair: Resolder U21-2 and retest with the automated test as well as the special instruction test.

Case solved!

If you have any questions about troubleshooting, please send Fran an email: fran_oconnell@fxoinc.com

Or give me a call at 609-799-6450

The above troubleshoot was done in 40 hours. That was less than $6000.00 including travel costs! Cheap!

Request Quote or Rough Order of Estimate from FXO Inc. on your troubleshoot.