This quarter, I’m working on an algorithm described in a paper by Austin Fowler. This algorithm finds optimal sequences of quantum gates that approximate some arbitrary single-qubit quantum gate. Basically, given:

- A 2×2 unitary matrix U
- A list of “gates” G (2×2 unitary matrices)
- Some function D(a,b) that states how “close” a matrix a is to b

this algorithm finds sequences of gates from G that are “close” to U. Since a brute-force solution is too slow, Fowler uses an optimization: he basically skips redundant sequences. His method will take exponential time in the worst case, but generally provides shorter gate sequences than other algorithms.

My goal this quarter is to explore ways to further optimize this algorithm’s performance. There are several ways to do this:

**Parallelize the algorithm**to arbitrary scale: the more processing units, the better! I will need to decide how much data the processing units can share, though. The more data that processing units can share, the less likely it is that they’ll visit redundant sequences. However, processing units that spend little time sharing data won’t have to worry about synchronizing each other.**Use multi-ported or banked memory**to allow multiple simultaneous memory operations. Multi-ported memory can be read from and written to by multiple processing units at the same time. Banked memory is split into separate “banks” which can be accessed independently. These approaches facilitate sharing, but may come at the cost of memory speed.**Produce large, highly optimized lookup tables**containing precomputed data. Technically, these tables are supposed to be a*result*of my research, so preparing them now is putting the cart before the horse! Additionally, whether or not a sequence is “unique” depends on how you define “close” — which depends on what a given researcher wants!

I’m attempting to implement this algorithm on a Field-Programmable Gate Array, a special chip that lets me download circuits onto it, connected to its pins. I can effectively building my own custom processor, without having to etch circuitry into silicon! By optimizing the processor’s design for this algorithm, I can — hopefully — speed it up in some significant way.

**What Happened This Week?**

I started carefully examining Fowler’s source code to figure out how best to implement it on an FPGA. I think I have a decent grasp on everything except the details of the sequence generation algorithm.

I also worked on adding support for *complex multiplication matrices*. It’s pretty straightforward: implement an algorithm that multiplies two complex matrices together. By next week, I hope to post my work so far on Github.

Additionally, I started working on a matrix distance calculator, for measuring how “close” two matrices are to each other.

Finally, I obtained a brand-new Altera DE-1 board from deep in the bowels of the computer lab here at UW. It was still in its brand new box from the turn of the millennium! It’s not the best hardware to work with, but it works.

**Potential Problem: Calculation Precision**

The Altera CycloneII FPGA I’m using has hardware multiplier units that support 18-bit operands. Since I know that all my numbers are between -1 and 1, I use all 18 bits for the fractional part of the numbers. I tack on an extra sign bit that is manipulated outside of the hardware multipliers, so I can get as much precision out of calculations as possible.

However, the least significant bit of my multiplication results is not always correct. If that bit is wrong, my answer is off by 3.8×10^-6. That error can accumulate over each subsequent calculation, resulting in inaccurate gate sequences.

To improve accuracy, I may need to use more bits to store my numbers. The Altera CycloneII FPGA’s multiplier blocks can operate on two pairs of 9-bit operands, or one pair of 18-bit operands. If I need larger numbers, I would split them into 9-bit or 18-bit parts, multiply each part of the first operand with each part of the second, and add the results (with some shifting). Thus, if each number is split into N parts, I need N^2 multipliers.

As you may guess, the tradeoff is the *time and resources vs. the accuracy*. If I want the accuracy of Fowler’s C++ implementation, which uses doubles, I would need at least 52-bit numbers. I would need *9 multiplier units* just to multiply two numbers, and I’d have to add all of those results, taking even more precious clock cycles to complete.

Thus, in order to guarantee the accuracy of my calculations, I need to quantify the relationship between the number of bits and the resulting error: how many bits do I need to maintain a tolerable level of error? I’m at a loss for how to do this, so tips would be appreciated.

**Coming Next Week**

- Implement the entire algorithm on the FPGA.
- Have a design meeting with hardware professor Carl Ebeling, to brainstorm strategies for optimization.
- Meet with quantum theory professor Aram Harrow, to brainstorm various mathematical optimizations.
- Figure out how to optimize the number of bits.