Awhile ago I mentioned inference in a Github post on my other blog in reference to security. But when we talk about inference in a wider informational sense, we often talk about complexity. And more formally, Kolmogorov complexity. That is, the complexity of an object, such as a string — and the length of the shortest computer program that produces that particular string as output. Information, and language, in a statistical sense, can be identified or predicted using various complexity approximations. This is effectively a way of encoding a text or individual characters so they can be uniquely represented.
Yesterday, a paper shared on the Association of Computational Linguistics was recently making the rounds on twitter, demonstrating the potential for a language model to do fast text classification using gzip and PyTorch.
The experiment uses gzip to calculate compression distance. But we can also demonstrate a similar strategy in pure Bash, too. Among the chatter on twitter, I saw some more senior software fellows discussing the use of this for various kinds of classification use-cases including malware analysis. And this post from 2011 was floating around, so I thought I'd use it as a boilerplate.
First we'll grab a Spanish (es-dict.txt) and English (en-dict.txt) dictionary which are roughly the same size. And in Bash, we'll consume a sentence as input, and use our compute_string_distance()
function to concatenate our string with a dictionary, compress it with gzip, then use wc
to get the number of bytes. Then we'll repeat this process solely on our wordlist, and then compute the difference between the sentence and wordlist size. Lastly, we'll echo both results and dictionaries into sort
to classify what wordlist our input more closely matches by virtue of distance. The less the distance, the closer the match.
#!/bin/bash
if [ $# -ne 1 ]; then
echo "Usage: bash langdetect.sh <sentence>"
exit 1
fi
compute_string_distance() {
sentence_length=$(echo -n "$1" | cat - "$2" | gzip | wc -c)
wordlist_length=$(gzip -c "$2" | wc -c)
echo "$((sentence_length - wordlist_length))"
}
sentence="$1"
english_result=$(compute_string_distance "$sentence" "en-dict.txt")
spanish_result=$(compute_string_distance "$sentence" "es-dict.txt")
result=$(echo -e "$english_result en-dict.txt\n$spanish_result es-dict.txt"\
| sort -n | head -n 1 |cut -d' ' -f2)
if [ "$result" == "en-dict.txt" ]; then
echo "English"
elif [ "$result" == "es-dict.txt" ]; then
echo "Spanish"
else
echo "Unknown"
fi
$ ./langdetect.sh "Self control, sobriety, and chastity indicate the power of the mind, not its passivity. " English $ ./langdetect.sh "Pero hablo español en la playa. No me propongo predicar, vive y deja vivir" Spanish
As we can see, each language has its own statistical complexity - it's own compression distance. And we can use compressors to quickly classify strings in this way. Neural nets sometimes use complexity approximations in a similar manner to do inference. And of course compression also serves to reduce resource usage. For example, consider the string D wht s rght nd fr.
This text is clearly missing some letters. In fact, it doesn't include any vowels. But depending on you, reader, there's a probability that you may have already filled in the blanks, just by intuitively inferring them. The full string is "Do what is right and fair"
. We can say that the first sentence without vowels is a compressed version of the second sentence. Compression schemes can allow us to both preserve resources and predict information. The same idiom is true whenever we use tar, or gzip, or 7z. We effectively delete some information, but information we can recover in a predictable way. This is how Gzip works—identifying repetitions in information via Huffman coding and LZ77/LZ78 algorithms, thus strategicially representing repetitive data in a more efficient way. On the other hand, you cannot compress "random data." And if you can, it's not random.
LZ77 algorithms achieve compression by replacing repeated occurrences of data with references to a single copy of that data existing earlier in the uncompressed data stream. A match is encoded by a pair of numbers called a length-distance pair, which is equivalent to the statement "each of the next length characters is equal to the characters exactly distance characters behind it in the uncompressed stream".[1]
Letter-oriented Huffman is good at detecting and compressing files using the letter frequencies alone, but it cannot detect correlations between consecutive letters (common words and syllables). LZ77 compresses an original file into an intermediate sequence of literal letters and "copy items".[2]
The compressor uses LZ77 compression to produce a stream of coded instructions, and uses one or more Huffman codebooks to compress those instructions.[3]
And decompression is the inverse operation. You can find some clinical remarks on gzip here. But the essence is this: underneath the hood of Gzip, Huffman coding in conjunction with LZ77 minimize the redundancy of the data being operated on, while preserving its uniqueness.
This is useful not only in neural nets, but in classifying data at large, as it can be used to draw relations between sets of data in general. If I recall correctly, compression was also the culprit in the CRIME SSL vulnerability, CVE-2012-4929, enabling a chosen plaintext attack.
This blog post was inspired after reading “Low-Resource Text Classification: A Parameter-Free Classification Method with Compressors", which uses gzip to measure distance between texts, outperforming several pretrained and non-pretrained strategies for classification, and minus a lot of hassle.
Though, in their paper, they implement a somewhat more rigorous scheme, using the k-nearest-neighbor algorithm. The authors use k-NN in combination with gzip to derive "similarity scores" for strings. The researchers wrap all of this into an implementation using PyTorch, and have shared their experiment on Github.
But it's cheerfully amusing that gzip—which lives on probably almost every device in the world, allegedly outperformed many other fancy algorithms.
Update: It seems the gzip paper is possibly too good to be true and the experiment may have contained an error which yielded erroneous results. And thus, gzip did not in fact outperform all of the other algorithms. Oh well.
No comments:
Post a Comment