On the spectral radius of weight matrices in RNNs

When initializing the weight matrix (let's assume there is only one) in an RNN (recurrent neural network) it is said (e.g. by Ilya Sutskever in his PhD thesis) that you want the spectral radius (the size of the largest eigenvalue in absolute value) to be slightly less than $1$.

A common way to initialize the weight matrix is,

and then play with the variance until it works.

In this post we'll do a bit of exploratory mathematics to prod the how the variance and size of a matrix, $W$, affects its spectral radius $\rho(W)$.

The distribution of the spectral radius

Since the spectral radius is somewhat difficult to work with theoretically, we'll take an experimental approach instead.

Let $W_{n,v}$ be the random matrix of size $n$ consisting of (zero mean) gaussian entries with variance $v$. The spectral radius is a random variable of these entries. How does the distribution of the spectral radius look for, say, $W_{10,2}$? Sampling a set of $10,000$ matrices from this distibution gives the following result.

It looks somewhat Poisson distributed! We could stop here, conjecture that it indeed is, and then try to prove it, but let's move on.

Fixing the matrix size

Typically when training RNNs the number of hidden units is first decided upon, and then you go about mucking with the variance. Below I've fixed the matrix size to $10$. I then changed the variance between $0.1$ and $10$, and looked at the expected spectral radius (since all we really care about is that $E\{ \rho(W)\} \approx 1$).

It looks linear! That's nice. The coefficient here is about $3$. Thus, for the case of $n=10$, we know that if our variance is, say, $v=0.1$, then the spectral radius will be about $\rho(W_{10,0.1}) = 0.3$.

Varying the matrix size

What if we train the network, and then decide that we'd really like more hidden units? Can we be sure that the spectral radius will stay the same (assuming we don't change the variance)?

Above I'm varying the size of the matrix, while looking at the proportion between the expected spectral radius and variance in the entries. It's not constant!

In other words, be aware that when increasing the size of a matrix then its spectral radius will also increase.

Conclusion

The conclusion of this post is basically just that if you have something like this in your code,

then you should be aware that if you change the size of the matrix, then you'll also have to change the variance.

Fixing it

How could we go about fixing this? Preferably we'd like a theoretically motivated expression between the variance, size of the matrix, and spectral radius, so that we can ensure the radius is size-invariant (by automatically changing the variance).

Doing it theoretically is a theorem for another day. But one ad hoc method often suggested is the following. (There are theoretical arguments, but I've never seen one from the perspective of trying to keep the spectral radius fixed.)

Amazingly, when we calculate the proportion between the expected spectral radius and the variance, as a function of the matrix size (using the normalization trick above), it's almost always at $\sim 1$.

The conclusion is that when you want the spectral radius to remain fixed; just use the above instead.