On the spectral radius of weight matrices in RNNs
When initializing the weight matrix (let's assume there is only one)
in an RNN (recurrent neural network) it is said (e.g. by Ilya Sutskever
in his PhD thesis) that you want the spectral radius (the size
of the largest eigenvalue in absolute value) to be slightly less
than $1$.
A common way to initialize the weight matrix is,
and then play with the variance until it works.
In this post we'll do a bit of exploratory mathematics
to prod the how the variance and size of a matrix, $W$,
affects its spectral radius $\rho(W)$.
The distribution of the spectral radius
Since the spectral radius is somewhat difficult to
work with theoretically, we'll take an experimental approach
instead.
Let $W_{n,v}$ be the random matrix of size $n$ consisting of
(zero mean) gaussian entries with variance $v$.
The spectral radius is a random variable of these entries.
How does the distribution of the spectral radius
look for, say, $W_{10,2}$? Sampling
a set of $10,000$ matrices from this distibution
gives the following result.
It looks somewhat Poisson distributed!
We could stop here, conjecture that it indeed is, and
then try to prove it, but let's move on.
Fixing the matrix size
Typically when training RNNs the number of hidden units
is first decided upon, and then you go about mucking with the variance.
Below I've fixed the matrix size to $10$. I then
changed the variance between $0.1$ and $10$, and looked at
the expected spectral radius (since all
we really care about is that $E\{ \rho(W)\} \approx 1$).
It looks linear! That's nice.
The coefficient here is about $3$.
Thus, for the case of $n=10$, we know
that if our variance is, say, $v=0.1$,
then the spectral radius will be about $\rho(W_{10,0.1}) = 0.3$.
Varying the matrix size
What if we train the network, and then decide
that we'd really like more hidden units?
Can we be sure that the spectral radius
will stay the same (assuming we don't change the variance)?
Above I'm varying the size of the matrix,
while looking at the proportion between the
expected spectral radius and variance in the
entries. It's not constant!
In other words, be aware that when increasing
the size of a matrix then its spectral radius will
also increase.
Conclusion
The conclusion of this post is basically
just that if you have something like this in your code,
then you should be aware that if you change the size
of the matrix, then you'll also have to change
the variance.
Fixing it
How could we go about fixing this?
Preferably we'd like a theoretically motivated
expression between the variance, size of the matrix,
and spectral radius, so that we can
ensure the radius is size-invariant
(by automatically changing the variance).
Doing it theoretically is a theorem
for another day. But one
ad hoc method often suggested
is the following.
(There are theoretical arguments,
but I've never seen one from the perspective
of trying to keep the spectral radius fixed.)
Amazingly, when we calculate the proportion
between the expected spectral radius and the variance,
as a function of the matrix size (using the normalization
trick above), it's almost always at $\sim 1$.
The conclusion is that when you
want the spectral radius to remain
fixed; just use the above instead.