Skip to main content

Latin1 vs UTF8

Latin1 was the early default character set for encoding documents delivered via HTTP for MIME types beginning with /text . Today, only around only 1.1% of websites on the internet use the encoding, along with some older applications. However, it is still the most popular single-byte character encoding scheme in use today. A funny thing about Latin1 encoding is that it maps every byte from 0 to 255 to a valid character. This means that literally any sequence of bytes can be interpreted as a valid string. The main drawback is that it only supports characters from Western European languages. The same is not true for UTF8. Unlike Latin1, UTF8 supports a vastly broader range of characters from different languages and scripts. But as a consequence, not every byte sequence is valid. This fact is due to UTF8's added complexity, using multi-byte sequences for characters beyond the general ASCII range. This is also why you can't just throw any sequence of bytes at it and ex...

The Etymology of 'Deadline'

The word 'deadline' has a popular etymology story around 19th century prison culture. The explanation of the meaning behind the word 'deadline' often goes something like this — as many etymologists and US dictionaries frequently cite it:

"The word deadline likely stems from 19th-century prison culture, where the word denoted a line within or around a prison that prisoners were not allowed to cross. And those who did so risked being shot and killed."

The popular theory is typically followed by a secondary story involving newsprint stories, which goes something like this:

.. and shortly thereafter, journalists and newsprint companies began using the word 'deadline' to describe the dates by which publications must be completed. Works not completed by the deadlines would not be printed and would 'die.'

So, how can we glean the frequency of the words "deadline," "dead line," and "dead-line", for the years 1850 to 1940? How can we empirically check this theory?

Well, if we look at Google Books "Ngram viewer," we can see for example that, indeed, around the 19th century, that the usage of the word 'deadline' jumps from zero to suddenly being charted in word frequency.

Ngrams for variants of 'deadline'

So, in the above graph, we can absolutely see that this fact is true. Usage of the word "deadline" spins up around these dates. Indeed, the data correlates closely to the timeframes of things happening in the 19th century involving war and prison culture. And also much later with the advent of newsprint companies, when there is a meteoric rise in the usage of the word.

The data seems to suggest that this was indeed some of the earliest usage of the word deadline. That this is an early record of a new word being coined, likely influenced by 19th century prison culture.

But on the other hand, "deadline" can also be further analyzed as a combination of two simpler loanwords or cognates: "dead" and "line." And both are words which have existed long before the 19th century.

In this case, they're likely cognates, likely descending from the languages of our ancestors and have entymological connnections to common parent languages.

To dig deeper into our inquiry about the etymology, let's break it into its constituent parts. How much usage do we see sampling ngrams for only the word "dead," from the years 1850 to 1940?

Ngrams for the word 'dead'

We see a lot more data here. Sampling for the individual word 'dead' returns magnitudes greater word frequency. It hovers around from 0.0100% to 0.0120%.

Great. Alright, now what if we sample ngrams for just the word "line," in the same set of years, from 1850 to 1940?

Ngrams for the word 'line'

The frequency of the word 'line' hovers around 0.0200% to 0.0300%, which is significantly higher than the combined word 'deadline.'

This examination of the words, in an independent way, serves to illustrate their extensive and longstanding usage in our language, predating the 19th century.

This is merely data to argue what we already intuitively know — that they are words sort of "built in" to the English language, so to speak. Their frequent use suggests they are part of a larger structure of our language.

Of course, this is due to evolutionary processes, the mass transmission of language across generations, population dynamics, and the influence of selection and network effects.

Indeed, many of the words we have today hone from a long evolutionary tree, often having their origins in proto-indo-european languages like Baltic, Slavic, and Germanic languages. And Dutch. And Swedish. And so on.

For example, in Proto-German, the word for "dead" is "daud" or "daudaz." And in modern German, it is "tot." Both of these words have somewhat similar phonetic sounds to the English word "dead."

But other proto-indo-european languages offer even more empirically compelling and ancient words for death - and specifically the word "dead."

For example, the Danish and Swedish words for "dead" involve variations of the word "död." This word is pronounced very closely to the way we say "dead" in English today.

And the English word "line?" It is also quite similar to its ancestral, proto-indo-european counterparts. In German, the word for "line" is "Linie." And in Dutch, Norwegian, and Swedish, it is "linje." But the word line goes back even further to Latin - where it was used as the word for flax, "Linum."

So, while the word "deadline" likely sprang from 19th century prison culture, a further empirical analysis may be made to say that the individual words themselves are likely of proto-indo-european descent, and hone from our very distant ancestors in another time.

This isn't to say that the etymology story surrounding the word "deadline" is false, but rather that its potential origins in 19th century prison culture and newsprint companies is merely part of a larger story about language.

Comments

Popular posts from this blog

yt-dlp Archiving, Improved

One annoying thing about YouTube is that, by default, some videos are now served in .webm format or use VP9 encoding. However, I prefer storing media in more widely supported codecs and formats, like .mp4, which has broader support and runs on more devices than .webm files. And sometimes I prefer AVC1 MP4 encoding because it just works out of the box on OSX with QuickTime, as QuickTime doesn't natively support VP9/VPO9. AVC1-encoded MP4s are still the most portable video format. AVC1 ... is by far the most commonly used format for the recording, compression, and distribution of video content, used by 91% of video industry developers as of September 2019. [ 1 ] yt-dlp , the command-line audio/video downloader for YouTube videos, is a great project. But between YouTube supporting various codecs and compatibility issues with various video players, this can make getting what you want out of yt-dlp a bit more challenging: $ yt-dlp -f "bestvideo[ext=mp4]+bestaudio[ext=m4a]/best...