You can download the LaTeX source and figures here.

]]>I gave a presentation at Microsoft QuArC a few weeks ago; visit my previous post for details and slides. Group member Alex Bocharov encouraged me to focus on demonstrating the *value* of the MITM optimization. The whole group wanted me to back up more of my claims numerically (and they probably wanted more interesting claims period!).

Specifically, Alex was interested in the efficiency of the one-dimensional index I used to match up left and right matrices: for a given left matrix L, how well does the index find right matrices that are actually close to L? As it turns out, only **.0003%** of the matrices returned by the index are close. Since the number of matrices it returns will probably grow exponentially, a speedup here could noticeably improve the algorithm performance!

**A Quick Review…**

One key part of the “meet in the middle” algorithm is the “meeting” part: given a left matrix, my program needs to find a corresponding nearby right matrix. It basically searches for all points in a point cloud that are near a desired point, using the “Fowler” distance measure.

To optimize this “meeting” process in my original paper, I used a one-dimensional index. All of the right matrices are indexed by their distance from the target gate. To query for all right matrices near some left matrix L, I found the left matrix’s distance D from the target gate. Since the Fowler distance measure uses the triangle inequality, I know that any matrices which are close to L also have a distance to the target gate that’s within some range R around D. I used a binary tree to query for all target gates whose distances are within R.

However, one dimension is not enough: *only .0003% of the right matrices the query returned were close to L*. Thus, I need a better algorithm that supports range queries over more dimensions, such as a k-d tree. (In fact, in my original MITM implementation, I never really explored the performance improvement offered by the 1D index over a linear search. I hope to remedy this in the next iteration of the paper.)

**Introducing FLANN**

FLANN stands for Fast Library for Approximate Nearest Neighbor. If you give it a set of data points, it will build an index that lets you rapidly find:

- all points within some radius around a given point.
- the k nearest neighbors of a given point (where k is an integer greater than 0).

Compared to other similar libraries, FLANN is very flexible:

- It allows me to specify my own distance measure. Most k-d tree libraries use Euclidean distance measures, which don’t apply to my problem.
- It supports points with any dimensionality (3D, 4D, etc.). Many libraries are limited to just 3D.
- FLANN supports a number of different index types, including k-d trees.
*In fact, if I use three-dimensional points, it even has a k-d tree implementation that runs on nVidia graphics cards!*

The manual mentions a special requirement for distance measures when using k-d trees: the distance must be *additive*. In other words: the final distance should be computable by adding distances between individual components. Fortunately, with the matrix parameterization that Aram and the QuArC group use, I can represent all matrices as vectors of just four (or even just three) real numbers. I can simplify the Fowler distance calculation to a dot product of two matrices’ vectors. Since the dot product meets our definition of additive, I can take advantage of FLANN’s k-d tree implementation with relatively little effort.

**Overcoming the Batch Limitation**

FLANN’s API requires my program to provide all of the data for the index at once. However, I would like to add new unique matrices as I find them. This could require some mucking around within FLANN, which I’d prefer to avoid (at least initially).

Instead, I’ll use FLANN alongside my original index. All unique matrices will be added to the 1D index as they are discovered. Then, once the 1D index becomes “big enough”, the program will rebuild the FLANN index and empty the 1D index. I’ll need to figure out good values for “big enough” that balance FLANN index build time with the performance gain of higher quality search results. An optimal value will offer the lowest run time to enumerate all sequences of some fixed length.

**The Rest of the Quarter**

Now that half of the quarter is gone, I’ve pared down my expectations quite a bit. This quarter, I plan to:

- Get a k-d tree optimization, such as FLANN, working optimally (and quantify its optimality).
- Finish the FPGA implementation of the algorithm, with Austin’s optimizations.
- Complete my research paper, answering some unanswered research questions such as the performance improvement of the 1D index.

I don’t think I’ll be able to extend my algorithm to run on a MapReduce system by the end of the quarter. However, if there’s time, it would be an interesting thing to try!

]]>

NOTE: this post is back-dated. I was too lazy to publish anything about the trip earlier!

]]>NOTE: Ignore the trend lines; they should be fit to the last five or six points because the different runs don’t have distinct consistent behavior until then.

There are a ton of tiny fixes and optimizations I plan to try when I have time. I hope to *really* finish the research next quarter. Also, that’s when Paul, Aram, and Carl, the advisors who advised a lot of my work, will tell me all of the things I’ll need to fix before this paper is published. I’m looking forward to the k-d tree optimization in particular: intuition tells me that it’s the next largest bottleneck in this method. I know that I probably ought to have a *finished* product for my Honors project, but true research is never finished!

The board is giving me results! This one is very trivial though, because I just gave it the Hadamard gate to target. Now I just need to allow it to actually calculate sequences of interesting sizes.

**Coming Up Next Monday:**

- Complete paper draft
- Presentation draft
- The rest of the sequencer hardware implementation results!

Last week, my laptop died — which caused me to lose a little bit of my work. Also, I had to finish a major project over this last weekend. Thus, my progress has been somewhat stunted. However, I have managed to make the FPGA multiply matrices that I download to it from a computer, and I have the full unoptimized sequencer working in simulation. The major tasks I completed this week were:

- Converting the numeric precision from 18 bits to 36 bits.
- Finishing sequence generator and sequence multiplier modules.
- Creating a coordinator module to contain and connect all the modules together. It also handles communication with the computer.

*You can skip to the Results section if you’re not hardware inclined…*

**Fitting It Onto The Board**

I then tried to fit the matrix multiplier module onto the Altera CycloneII FPGA. Even though this circuit wasn’t my final design, it contained several major components. By attempting to fit a few components onto the board at a time, I could address any problems as they occur.

I knew that my circuit used too many hardware multipliers, but I didn’t realize that it needed nearly 2.6 times the number of logic elements that were available on the chip!

The complex matrix multiplier module, apparently, uses 47327 combinational functions. That’s because it contains 8 complex number multiplier modules, each consuming about 6000 combinational functions. This result reaffirms what I already knew: I would only be able to use a few complex number multiplier modules in the final design. In fact, since each complex number module requires 16 hardware multipliers, and the board only has 26, I can fit just one onto the board.

Basically: I can’t do multiplication in parallel because there aren’t enough resources. Other modules, like the matrix distance module, had a similar problem. The solution is to increase the latency of the calculations so that I only need a few multipliers at a time.

**Transmission Problems**

Once I fixed that problem, I downloaded the updated circuit onto the board. When I transmitted the numbers, the board appeared to go through the proper states. But when it entered the transmission state, I received no data back! Since it was working just fine in simulation, I opened up the SignalTap Logic Analyzer. Here’s what I found:

After the coordinator module entered the transmission state, the data_to_send_ready signal went high, meaning the data was ready to send. However, the UART sender module didn’t do anything. The UART sender uses a clock divider to operate at the baud rate of the serial port, meaning it only checks for data_to_send_ready every 434 clock cycles. Since the data_to_send_ready signal wasn’t high when the UART sender did this check, nothing was sent.

To fix this, I made the data_to_send_ready signal stay high until the sender started transmitting data.

**Results**

To test the system, I multiplied the Hadamard gate by the T gate. The matrix multiply operands are on the top, and the final matrix is the expected result:

After running the Python commander program I wrote, passing in the operands, I obtained the result I expected:

It looks like these results are accurate to about +/- 10^-10; any digits after the first ten decimal places are off. Hopefully, that should be good enough for our purposes, but I’ll need to examine the effects of error accumulation over gate sequences of 60 or more.

**Next Steps:**

We need to decide on a presentation date and paper due date. I would prefer to have this due date during finals week, so I can get data on several optimizations before presenting. If I have to present next Friday (our original deadline), I would cover:

- Runtime performance on the FPGA, without optimization
- A “meet in the middle” optimization, implemented in software on the computer.

Presenting during finals week (preferably at the end) would allow me to try one (or more) of the following optimizations:

- Austin Fowler’s optimizations, running on the FPGA
- “Meet in the middle” optimization on the FPGA
- FPGA tree memory lookup optimization

I probably won’t bother speeding up the clock on the FPGA, because it will earn a constant-time improvement at best and isn’t really a novel technique.

Please let me know when you’d like me to present and turn in my paper. Be aware that I will publish a draft of my paper and presentation within a week of the deadline, so you won’t suddenly have a paper to grade just days before grades are due!

]]>- Completing and testing the gate table module from last week, which provides the matrix for a gate given that gate’s index number.
- Adjusting the fixed-point number format to use the most significant bit to store a one.

**Fixed Point Format Adjustments**

I’m using a fixed-point format to represent numbers in matrix calculations, since one property of these calculations is that the magnitude of the numbers won’t change significantly. My original design used all available bits for the fractional part of the number, but this means a one can’t be stored. Thus, I would essentially have to use .99999 instead of 1, introducing unnecessary error. More importantly, if a calculation results in a number slightly greater than 1, the 1 might be truncated, leaving only the fractional part.

So, I now use the most significant bit to store the one’s place digit. This leaves 17 bits for the fractional part. I can now represent numbers in the range (-2, 2). This adjustment involved a minor change to the fixed-point multiplication module.

I might automatically truncating all matrix cells to be within the range [-1, 1], since no matrix member should exceed this range. This could eliminate some error, but may not be worth the additional logic delay.

**Simulating the Sequence Multiplier**

Here’s a screenshot of the ModelSim output. **result_mtx** is a bus that contains the result of the matrix multiplication. The contents are grouped with braces in row-major order, each complex number is displayed in the format {real-part, imaginary-part}.

I verified this result using Sage, a free computer algebra system available at sagenb.org. I multiplied the first three gates together. Then, I multiplied the result by 2^17, because 17 of the 18 bits in my fixed-point numbers are part of the fraction. Thus, the result I get reflects what I would see in simulation:

The results are only off by one, if you consider the lack of rounding. This precision is not common, though: the first two matrices involved multiplication by one.

**Coming Up This Weekend/Next Week**

My main goal is to get performance numbers for the algorithm (1) on the computer, and (2) on the board. In later weeks, I’ll apply optimizations and analysis to each implementation, to see how much the performance can improve.

**Implement the brute-force sequence generator.**If you imagine that each gate is a digit in a number, this generator is nothing more than an incrementor.**Implement a brute-force version of the algorithm.**This part involves little more than connecting up modules I’ve already developed. It will*not*check for duplicates, because I haven’t gained access to the onboard RAM yet.**Implement communication between the board and the computer**over the serial port. The computer will provide a gate to compile, and the board will report the result when found. It will also periodically indicate progress and performance statistics.**Increase numeric precision to 36 bits.**This process is fairly straightforward, but it may complicate the process of fitting things onto the board.**Download the design onto the board.**This won’t happen until Monday evening at the earliest. This step involves getting the design to fit, meaning I need to intelligently schedule the hardware multipliers.**Start drafting the paper.**

If you want to read my current paper draft, look at my source code on GitHub, etc., click **Quantum Compiler** at the top of this page.

Last week, I had a lot going on in other classes. I didn’t have much time to work on this project until Wednesday, so I only managed to do a little more design and implementation work. We had a design meeting on Friday to discuss the goals of the project, and I think we’ve nailed down a firm goal: to approximate the pi/6 gate to better than 10^7 accuracy. Here’s what I’ve done so far:

- I now have a working distance module, which can be used to compare the distance between matrices.
- I also finished dissecting most of Austin Fowler’s algorithm, and am finishing up the implementation.
- I also posted a design document on Google Docs, my source code on Github, and a project summary on this site. You can get to all of them from the project summary.

I’m on track to finish the implementation this week on the FPGA! In future weeks, I’ll analyze the performance by profiling the algorithm on the computer and the FPGA. Then, I’ll make the appropriate optimizations. If all goes well, I’ll meet the goal and have something worthy of publication! If not, I should have enough information to provide someone with a good starting point to continue this work in the future.

]]>The title says it all. This was the end of one of the worst weeks I’ve ever survived. I really need to stop procrastinating…

]]>

This quarter, I’m working on an algorithm described in a paper by Austin Fowler. This algorithm finds optimal sequences of quantum gates that approximate some arbitrary single-qubit quantum gate. Basically, given:

- A 2×2 unitary matrix U
- A list of “gates” G (2×2 unitary matrices)
- Some function D(a,b) that states how “close” a matrix a is to b

this algorithm finds sequences of gates from G that are “close” to U. Since a brute-force solution is too slow, Fowler uses an optimization: he basically skips redundant sequences. His method will take exponential time in the worst case, but generally provides shorter gate sequences than other algorithms.

My goal this quarter is to explore ways to further optimize this algorithm’s performance. There are several ways to do this:

**Parallelize the algorithm**to arbitrary scale: the more processing units, the better! I will need to decide how much data the processing units can share, though. The more data that processing units can share, the less likely it is that they’ll visit redundant sequences. However, processing units that spend little time sharing data won’t have to worry about synchronizing each other.**Use multi-ported or banked memory**to allow multiple simultaneous memory operations. Multi-ported memory can be read from and written to by multiple processing units at the same time. Banked memory is split into separate “banks” which can be accessed independently. These approaches facilitate sharing, but may come at the cost of memory speed.**Produce large, highly optimized lookup tables**containing precomputed data. Technically, these tables are supposed to be a*result*of my research, so preparing them now is putting the cart before the horse! Additionally, whether or not a sequence is “unique” depends on how you define “close” — which depends on what a given researcher wants!

I’m attempting to implement this algorithm on a Field-Programmable Gate Array, a special chip that lets me download circuits onto it, connected to its pins. I can effectively building my own custom processor, without having to etch circuitry into silicon! By optimizing the processor’s design for this algorithm, I can — hopefully — speed it up in some significant way.

**What Happened This Week?**

I started carefully examining Fowler’s source code to figure out how best to implement it on an FPGA. I think I have a decent grasp on everything except the details of the sequence generation algorithm.

I also worked on adding support for *complex multiplication matrices*. It’s pretty straightforward: implement an algorithm that multiplies two complex matrices together. By next week, I hope to post my work so far on Github.

Additionally, I started working on a matrix distance calculator, for measuring how “close” two matrices are to each other.

Finally, I obtained a brand-new Altera DE-1 board from deep in the bowels of the computer lab here at UW. It was still in its brand new box from the turn of the millennium! It’s not the best hardware to work with, but it works.

**Potential Problem: Calculation Precision**

The Altera CycloneII FPGA I’m using has hardware multiplier units that support 18-bit operands. Since I know that all my numbers are between -1 and 1, I use all 18 bits for the fractional part of the numbers. I tack on an extra sign bit that is manipulated outside of the hardware multipliers, so I can get as much precision out of calculations as possible.

However, the least significant bit of my multiplication results is not always correct. If that bit is wrong, my answer is off by 3.8×10^-6. That error can accumulate over each subsequent calculation, resulting in inaccurate gate sequences.

To improve accuracy, I may need to use more bits to store my numbers. The Altera CycloneII FPGA’s multiplier blocks can operate on two pairs of 9-bit operands, or one pair of 18-bit operands. If I need larger numbers, I would split them into 9-bit or 18-bit parts, multiply each part of the first operand with each part of the second, and add the results (with some shifting). Thus, if each number is split into N parts, I need N^2 multipliers.

As you may guess, the tradeoff is the *time and resources vs. the accuracy*. If I want the accuracy of Fowler’s C++ implementation, which uses doubles, I would need at least 52-bit numbers. I would need *9 multiplier units* just to multiply two numbers, and I’d have to add all of those results, taking even more precious clock cycles to complete.

Thus, in order to guarantee the accuracy of my calculations, I need to quantify the relationship between the number of bits and the resulting error: how many bits do I need to maintain a tolerable level of error? I’m at a loss for how to do this, so tips would be appreciated.

**Coming Next Week**

- Implement the entire algorithm on the FPGA.
- Have a design meeting with hardware professor Carl Ebeling, to brainstorm strategies for optimization.
- Meet with quantum theory professor Aram Harrow, to brainstorm various mathematical optimizations.
- Figure out how to optimize the number of bits.