Running the Matrix Multiplier On the Board

Some Bad News…

Last week, my laptop died — which caused me to lose a little bit of my work.  Also, I had to finish a major project over this last weekend.  Thus, my progress has been somewhat stunted.  However, I have managed to make the FPGA multiply matrices that I download to it from a computer, and I have the full unoptimized sequencer working in simulation.  The major tasks I completed this week were:

  • Converting the numeric precision from 18 bits to 36 bits.
  • Finishing sequence generator and sequence multiplier modules.
  • Creating a coordinator module to contain and connect all the modules together.  It also handles communication with the computer.

You can skip to the Results section if you’re not hardware inclined…

Fitting It Onto The Board

I then tried to fit the matrix multiplier module onto the Altera CycloneII FPGA.  Even though this circuit wasn’t my final design, it contained several major components.  By attempting to fit a few components onto the board at a time, I could address any problems as they occur.

I knew that my circuit used too many hardware multipliers, but I didn’t realize that it needed nearly 2.6 times the number of logic elements that were available on the chip!

The complex matrix multiplier module, apparently, uses 47327 combinational functions.  That’s because it contains 8 complex number multiplier modules, each consuming about 6000 combinational functions.  This result reaffirms what I already knew: I would only be able to use a few complex number multiplier modules in the final design.  In fact, since each complex number module requires 16 hardware multipliers, and the board only has 26, I can fit just one onto the board.

Basically: I can’t do multiplication in parallel because there aren’t enough resources.  Other modules, like the matrix distance module, had a similar problem.  The solution is to increase the latency of the calculations so that I only need a few multipliers at a time.

Transmission Problems

Once I fixed that problem, I downloaded the updated circuit onto the board.  When I transmitted the numbers, the board appeared to go through the proper states.  But when it entered the transmission state, I received no data back!  Since it was working just fine in simulation, I opened up the SignalTap Logic Analyzer.  Here’s what I found:

After the coordinator module entered the transmission state, the data_to_send_ready signal went high, meaning the data was ready to send. However, the UART sender module didn’t do anything.  The UART sender uses a clock divider to operate at the baud rate of the serial port, meaning it only checks for data_to_send_ready every 434 clock cycles.  Since the data_to_send_ready signal wasn’t high when the UART sender did this check, nothing was sent.

To fix this, I made the data_to_send_ready signal stay high until the sender started transmitting data.


To test the system, I multiplied the Hadamard gate by the T gate.  The matrix multiply operands are on the top, and the final matrix is the expected result:

After running the Python commander program I wrote, passing in the operands, I obtained the result I expected:

It looks like these results are accurate to about +/- 10^-10; any digits after the first ten decimal places are off.  Hopefully, that should be good enough for our purposes, but I’ll need to examine the effects of error accumulation over gate sequences of 60 or more.

Next Steps:

We need to decide on a presentation date and paper due date.  I would prefer to have this due date during finals week, so I can get data on several optimizations before presenting.  If I have to present next Friday (our original deadline), I would cover:

  • Runtime performance on the FPGA, without optimization
  • A “meet in the middle” optimization, implemented in software on the computer.

Presenting during finals week (preferably at the end) would allow me to try one (or more) of the following optimizations:

  • Austin Fowler’s optimizations, running on the FPGA
  • “Meet in the middle” optimization on the FPGA
  • FPGA tree memory lookup optimization

I probably won’t bother speeding up the clock on the FPGA, because it will earn a constant-time improvement at best and isn’t really a novel technique.

Please let me know when you’d like me to present and turn in my paper.  Be aware that I will publish a draft of my paper and presentation within a week of the deadline, so you won’t suddenly have a paper to grade just days before grades are due!

This entry was posted in Quantum Compiler Research. Bookmark the permalink.

Comments Closed

Comments are closed.