The capstone is a project where a team works with industry partners (aka the sponsors) to build a product according to the sponsors’ specifications.
Ours was something titled Mersenne Primality search. For the unfamiliar, a mersenne prime is a prime number exactly 1 less than 2 to some other prime number. These are very useful: all prime numbers create integer rings which do not contain subrings (the only thing that cleanly divides a prime integer is 1 and that integer, though this deserves further explanation and will be fleshed out when this post is further updated).
A sufficiently large mersenne prime number can be used for something termed a mersenne twister, which is a pseudo random number generator engine. However, these can be cracked given sufficient evidence to determine the next element in the sequence, since the mersenne number is publically available. Surely there are more applications. Classically, there are two approaches to obtaining and qualifying a mersenne prime:
- proof based methods founded in number theory
- (sensible) brute force.
Our project was to implement an algorithm for the latter on a target architecture a few may be familiar with: Google’s Tensor Processing Unit. This hardware has received great applause for its efficient throughput and high speed, though it primarily resides in google cloud (edge TPUs) exist for quantized neural network inference). There are a few issues intrinsic to the problem.
- these numbers are typically very large
- these numbers are very rare
Given these two constraints, one can immediately infer the following : * precision is extremely important * speed is nice where we can get it * naive hardware probably will not support both
Because of these first principles, one should consider first the viability of the tensor processing unit for this task. The speed is immediately appealing. The speed is achieved through hardware dedicated to those matrix operations which are typical of a deep learning architecture. Specifically, these are matrix-multiply-accumulate machines, with vector processing units physically beneath, where the linear transformation is passed through some activation function (these are typically non linear since nonlinearity is what imbues neural networks with their expressive power).
What is a matrix product mathematically? The matrix multiply we are all familiar with is a specific operation called the Cayley product. In the classic Ax = y form, the matrix A is considered to be composed of row vectors. These row vectors take the dot product with the column vector x. There will be actual closed form renderings of this in the near future, but that is the main idea. The result is a vector y which is the linear transformation of x by A. Neural networks operate with weight tensors transforming data tensors. What is a tensor, you may ask? A tensor is the generalization of a matrix, in its most general form (physicists imbue them with other properties that make them nice for describing relativistic physics) For example, a vector is a rank 1 tensor, a matrix is a rank 2 tensor, and (eg) our computer screen data is a rank 3 tensor (screen position is a matrix, color values are a 3 dimensional vector in color space), and so on. If we limit our neural networks we are currently considering to vectoral data, each layer maps to a weight matrix which linearly transforms the input data vector, and then takes the image of this linear transformation under some nonlinear function (which needn’t be everywhere differentiable! E.g. ReLU)
Great. So how does this pertain to the task at hand? The hardware of the MXU doesnt actually contain any way to map the data nonlinearly. That operation is performed by the vector processing unit. However, this silicon’s architecture is also limited to the specific use case of typical neural network operations. We first explore the specifics of the MXU and next the VPU, in order to motivate the decision outcomes for the viability of this architecture for extreme precision computing.
How does the matrix multiply unit work?
activations are passed through nonlinear functions, as well limited instruction set that All iterations of the TPU have been precision limited. v0 was, for example, designed for speech to text inference runtimes through quantized neural networks built in house by google. Going forward, google considered it an important offering to enable practitioners, both in house and out, to also train on the tpu. Hence, they implemented a floating point protocol that enabled efficiency, optimal hardware utilization, and were useful intrinsically for neural network training. This was the google brain floating point 16 protocol, which I explain in greater depth down below. However the main take aways are that the hardware supports, at most, half precision floating point, and this flavor of half precision sacrifices mantissa bits for greater exponent bits, in effect enabling greater expressive range, but very roughly. This obviously does not make the TPU appear a viable candidate for precision computing, and this is explicitly mentioned in the white paper here.