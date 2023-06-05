Summary

There is not binary files on one side, and text files on the other. All files are binary, some of those binary files can be interpreted as text.

Programming languages API to open files, as well as the web culture propagating the notion of "raw text" or "ANSI text", confused a lot of coders into thinking text was some kind of special case. It's not.

Just like audio, videos or archives, text files are simply an arrangement of bytes that you can understand if you follow the specifications describing how to decode it: the encoding and the character set.

All files are binary

Keeping up with the tradition of making a beginner-friendly article to start the week, today we are going to kill a myth: the idea that you have this divide between binary files and text files.

This misconception comes from the fact that a lot of API around file handling gives you a "text mode" and a "binary mode", which is confusing as hell.

Python is no exception to the rule:

Use open('file') to read a file and get a string.

Use open('file', 'rb') to read a file and get bytes.

Looking at this, it's easy to think that since there are two modes to open files, then there are two categories of files, and that they are either binary or text.

But no, all files are binary. Some of those 1 and 0 do represent text, and for convenience programming languages include a shortcut to get those bytes and translate them into text easily.

Let's say I write some text to a file:

with open("wololo.txt", "w", encoding="ascii") as f: f.write('oyo oyo!')

Now lets read it by telling Python to not translate it to texte:

>>> with open("wololo.txt", "rb") as f: ... print(list(f.read())) ... [111, 121, 111, 32, 111, 121, 111, 33]

I get numbers between 0 and 255, so bytes.

Deep down, all files are composed of bytes.

Dividing files between text and binary makes no more sense than dividing files between audio and binary, video and binary or zip and binary.

There are thousands of ways to talk about the same thing

If I want to talk about the number of fingers I have on my right hand, I can describe it using numerous representations:

decimal system: 5

English word: five

roman numeral: V

mandarin symbol: 五

binary base: 101

C struct using a signed short little-endian from Python: bytes([0b101, 0b0])

They are all arbitrary conventions, yet they all mean the exact same thing.

However, V and 101 can be understood differently depending of the context.

It's also the case for file formats.

The lie of raw text

For decades, people propagated on the internet this notion that something is raw text, or even worse, "ANSI text", to justify that some files don't need some specialized tooling to process.

What really happens is that text, like any format, is just bytes in a certain order, matching a spec. If you follow the spec, you analyse the bytes, and translate those bytes into symbols.

It's the same for music, you read the bytes, you may follow the FLAC or MP3 spec, and get sound. For video, or archives, same. It's all about the convention humans agree on to say, "those arbitrary bytes mean this stuff in this context".

This agreement is often called a format or a standard.

For text, it's the same, we have something called a character set and an encoding specification, and if you follow them, you can read bytes and get sense out of them.

You cannot read bytes without knowing the agreement, because bytes don't mean anything by themself. In fact, the same bytes, using different agreements, mean different things.

E.G, the numbers "195" followed by "182" ("11000011" and "10110110" in binary), can be understood differently if you chose to interpret them in "utf8":

>>> bytes([0b11000011, 0b10110110]).decode("utf8") 'ö'

That's a French letter O with a dieresis.

And if you interpret it in "gb2312":

>>> bytes([0b11000011, 0b10110110]).decode("gb2312") '枚'

We get a Chinese character used to talk about flat and thin objects, such as a sheet of paper.

Same values, but depending of the agreement we use, we understand something different.

So there is not even such a thing as "raw" text, because any text can only be understood if you know the charset and encoding used to produce it.

Most likely, though, when people say "raw text", they likely talk about text using the ASCII character set. This is a very basic, but universally understood format that even the most primitive computer system can read and display without any problem. If you click on a file containing ASCII on Windows, it will open notepad.exe, decode the content and show you the text seamlessly, giving you the impression there is no work behind this operation.

Plus ASCII can represent all English words, which mean the Americans can get away with using it everywhere for a long time without having to deal with those pesky foreigner glyphs.

At this point you probably understand that when you see some guy in a movie telling you he can read "binary", it means absolutely nothing. At best, it may mean they can decipher ASCII text from a hexadecimal or binary representation (or a matrix wallpaper).

Which is cool, don't get me wrong.

I can't do it.

Where Python helps experts and tricks beginners

If you create bytes and print them, the Python shell may show you letters in something that looks like a string:

>>> bytes([0b1100001, 0b1100010, 0b1100011, 0b1100100]) b'abcd'

This is a great source of confusion for beginners.

But note the b' , this is a subtle clue that the object is not a string, but bytes:

>>> type(b'abcd') <class 'bytes'> >>> type('abcd') <class 'str'>

Wait, we have bytes, a sequence of numbers, and bytes can mean something else than text, so why when I create bytes Python shows me something that looks like text?

Well, it's a convenience feature for professional devs that often work with bytes.

You see, if you have to deal with bytes, you will see protocols that mix ASCII text and other formats. Also, if you are used to it, bytes are easier to read using an hexadecimal notation.

So when Python show you bytes, it shows everything that can be mapped to an ASCII character as such, and anything else using an hexadecimal representation:

>>> bytes([0, 1, 200, 245]) # hex b'\x00\x01\xc8\xf5' >>> bytes([44, 45, 46]) # ascii b',-.' >>> bytes([0, 1, 44, 200, 45, 245, 46]) # both b'\x00\x01,\xc8-\xf5.' >>> 0xf5 # F5 is 245 in hex 245

Yeah, I know, it's doesn't look like it's easier, but it's an acquired taste.

Python may show you total giberish, but an experienced low level dev will see the number behind.

It's also a nice shortcut. It's easier to type b"abcd" than bytes([0b1100001, 0b1100010, 0b1100011, 0b1100100]) . I only used the latter to make a point at the start of the article.

But yes, if you don't pay attention, you may be getting the wrong impression.

The point is, here, we don’t use a number to represent text, we use text to represent a number:

>>> b"\xf5"[0] == 245 == 0xF5 # different notations, same value True

A few recommendations when dealing with text