Skip to main content

Latin1 vs UTF8

Latin1 was the early default character set for encoding documents delivered via HTTP for MIME types beginning with /text . Today, only around only 1.1% of websites on the internet use the encoding, along with some older appplications. However, it is still the most popular single-byte character encoding scheme in use today. A funny thing about Latin1 encoding is that it maps every byte from 0 to 255 to a valid character. This means that literally any sequence of bytes can be interpreted as a valid string. The main drawback is that it only supports characters from Western European languages. The same is not true for UTF8. Unlike Latin1, UTF8 supports a vastly broader range of characters from different languages and scripts. But as a consequence, not every byte sequence is valid. This fact is due to UTF8's added complexity, using multi-byte sequences for characters beyond the general ASCII range. This is also why you can't just throw any sequence of bytes at it and e...

Compression Schemes as Prediction Schemes

Awhile ago I mentioned inference in a Github post on my other blog in reference to security. But when we talk about inference in a wider informational sense, we often talk about complexity. And more formally, Kolmogorov complexity. That is, the complexity of an object, such as a string — and the length of the shortest computer program that produces that particular string as output. Information, and language, in a statistical sense, can be identified or predicted using various complexity approximations. This is effectively a way of encoding a text or individual characters so they can be uniquely represented.

Yesterday, a paper shared on the Association of Computational Linguistics was recently making the rounds on twitter, demonstrating the potential for a language model to do fast text classification using gzip and PyTorch.

The experiment uses gzip to calculate compression distance. But we can also demonstrate a similar strategy in pure Bash, too. Among the chatter on twitter, I saw some more senior software fellows discussing the use of this for various kinds of classification use-cases including malware analysis. And this post from 2011 was floating around, so I thought I'd use it as a boilerplate.

First we'll grab a Spanish (es-dict.txt) and English (en-dict.txt) dictionary which are roughly the same size. And in Bash, we'll consume a sentence as input, and use our compute_string_distance() function to concatenate our string with a dictionary, compress it with gzip, then use wc to get the number of bytes. Then we'll repeat this process solely on our wordlist, and then compute the difference between the sentence and wordlist size. Lastly, we'll echo both results and dictionaries into sort to classify what wordlist our input more closely matches by virtue of distance. The less the distance, the closer the match.

#!/bin/bash

if [ $# -ne 1 ]; then
    echo "Usage: bash langdetect.sh <sentence>"
    exit 1
fi

compute_string_distance() {
    sentence_length=$(echo -n "$1" | cat - "$2" | gzip | wc -c)
    wordlist_length=$(gzip -c "$2" | wc -c)
    echo "$((sentence_length - wordlist_length))"
}

sentence="$1"

english_result=$(compute_string_distance "$sentence" "en-dict.txt")
spanish_result=$(compute_string_distance "$sentence" "es-dict.txt")

result=$(echo -e "$english_result en-dict.txt\n$spanish_result es-dict.txt"\
| sort -n | head -n 1 |cut -d' ' -f2)
    
if [ "$result" == "en-dict.txt" ]; then
    echo "English"
elif [ "$result" == "es-dict.txt" ]; then
    echo "Spanish"
else
    echo "Unknown"
fi
$ ./langdetect.sh "Self control, sobriety, and chastity indicate the power of the mind, not its passivity. "                                                   
English
$ ./langdetect.sh "Pero hablo español en la playa. No me propongo predicar, vive y deja vivir"                               
Spanish

As we can see, each language has its own statistical complexity - it's own compression distance. And we can use compressors to quickly classify strings in this way. Neural nets sometimes use complexity approximations in a similar manner to do inference. And of course compression also serves to reduce resource usage. For example, consider the string D wht s rght nd fr. This text is clearly missing some letters. In fact, it doesn't include any vowels. But depending on you, reader, there's a probability that you may have already filled in the blanks, just by intuitively inferring them. The full string is "Do what is right and fair". We can say that the first sentence without vowels is a compressed version of the second sentence. Compression schemes can allow us to both preserve resources and predict information. The same idiom is true whenever we use tar, or gzip, or 7z. We effectively delete some information, but information we can recover in a predictable way. This is how Gzip works—identifying repetitions in information via Huffman coding and LZ77/LZ78 algorithms, thus strategicially representing repetitive data in a more efficient way. On the other hand, you cannot compress "random data." And if you can, it's not random.

LZ77 algorithms achieve compression by replacing repeated occurrences of data with references to a single copy of that data existing earlier in the uncompressed data stream. A match is encoded by a pair of numbers called a length-distance pair, which is equivalent to the statement "each of the next length characters is equal to the characters exactly distance characters behind it in the uncompressed stream".[1]
Letter-oriented Huffman is good at detecting and compressing files using the letter frequencies alone, but it cannot detect correlations between consecutive letters (common words and syllables). LZ77 compresses an original file into an intermediate sequence of literal letters and "copy items".[2]
The compressor uses LZ77 compression to produce a stream of coded instructions, and uses one or more Huffman codebooks to compress those instructions.[3]

And decompression is the inverse operation. You can find some clinical remarks on gzip here. But the essence is this: underneath the hood of Gzip, Huffman coding in conjunction with LZ77 minimize the redundancy of the data being operated on, while preserving its uniqueness.

This is useful not only in neural nets, but in classifying data at large, as it can be used to draw relations between sets of data in general. If I recall correctly, compression was also the culprit in the CRIME SSL vulnerability, CVE-2012-4929, enabling a chosen plaintext attack.

This blog post was inspired after reading “Low-Resource Text Classification: A Parameter-Free Classification Method with Compressors", which uses gzip to measure distance between texts, outperforming several pretrained and non-pretrained strategies for classification, and minus a lot of hassle.

Though, in their paper, they implement a somewhat more rigorous scheme, using the k-nearest-neighbor algorithm. The authors use k-NN in combination with gzip to derive "similarity scores" for strings. The researchers wrap all of this into an implementation using PyTorch, and have shared their experiment on Github.

But it's cheerfully amusing that gzip—which lives on probably almost every device in the world, allegedly outperformed many other fancy algorithms.

Update: It seems the gzip paper is possibly too good to be true and the experiment may have contained an error which yielded erroneous results. And thus, gzip did not in fact outperform all of the other algorithms. Oh well.

Comments

Popular posts from this blog

yt-dlp Archiving, Improved

One annoying thing about YouTube is that, by default, some videos are now served in .webm format or use VP9 encoding. However, I prefer storing media in more widely supported codecs and formats, like .mp4, which has broader support and runs on more devices than .webm files. And sometimes I prefer AVC1 MP4 encoding because it just works out of the box on OSX with QuickTime, as QuickTime doesn't natively support VP9/VPO9. AVC1-encoded MP4s are still the most portable video format. AVC1 ... is by far the most commonly used format for the recording, compression, and distribution of video content, used by 91% of video industry developers as of September 2019. [ 1 ] yt-dlp , the command-line audio/video downloader for YouTube videos, is a great project. But between YouTube supporting various codecs and compatibility issues with various video players, this can make getting what you want out of yt-dlp a bit more challenging: $ yt-dlp -f "bestvideo[ext=mp4]+bestaudio[ext=m4a]/best...