The earliest example of neural network data compression that I am aware
of is Schmidhuber, J=FCrgen, and Stefan Heil (1996), "Sequential neural
text compression", IEEE Trans. on Neural Networks 7(1): 142-146. They
trained a 3 layer network by back propagation to predict the next
character given a 5 character context as input, and coded the
prediction using arithmetic coding. They improved a bit on UNIX
"compress" but it took days of CPU to train the model on just a few KB
of text. Also, back propagation of multi-layer networks requires
multiple training passes, so it was offline. They only compressed text
(in German), so their alphabet was size 80 rather than 256.
In 2000 I wrote some neural network compressors (P5, P6, P12) fast
enough to be practical, and a paper,
cs.fit.edu/~mmahoney/compression/nn_paper.htmlThe compression is better than gzip and the Schmidhuber-Heil system
but not as good as PPM. The major improvements were to eliminate
multiple training passes to make it online, train only the last layer,
to treat the input as a stream of bits rather than bytes (to speed up
the range coder), and to keep a 0/1 count history with each weight to
contral the learning rate. Later I dropped the weights and just kept
the 0/1 counts and the result was PAQ.
One of these days I will write PAQ7 and it will probably use neural
networks to combine predictions from different models. I have been
getting some good results combining predictions using:
p =3D squash(SUM(w
* stretch(p)))
where p is the probability that the next bit will be a 1 according
to the i'th model, stretch(p) =3D ln(p/(1-p)), squash(x) =3D 1/(1 + e^-x)
(inverse of stretch), and p is the final output to the range coder.
The weights w are updated using normal backpropagation:
w +=3D r * p * (1-p) * (y-p)
where y is the actual bit and r (about .001 to .01) is the learning
rate.
The neural network would probably replace the fixed weights now used to
average probabilities from the different mixer outputs in PAQAR. PAQAR
gets good compression because it combines a huge number of contexts and
models. I think these models could be computed in parallel to make the
speed reasonable.
-- Matt Mahoney