Summary
There is not binary files on one side, and text files on the other. All files are binary, some of those binary files can be interpreted as text.
Programming languages API to open files, as well as the web culture propagating the notion of "raw text" or "ANSI text", confused a lot of coders into thinking text was some kind of special case. It's not.
Just like audio, videos or archives, text files are simply an arrangement of bytes that you can understand if you follow the specifications describing how to decode it: the encoding and the character set.
All files are binary
Keeping up with the tradition of making a beginner-friendly article to start the week, today we are going to kill a myth: the idea that you have this divide between binary files and text files.
This misconception comes from the fact that a lot of API around file handling gives you a "text mode" and a "binary mode", which is confusing as hell.
Python is no exception to the rule:
Use
open('file')
to read a file and get a string.Use
open('file', 'rb')
to read a file and get bytes.
Looking at this, it's easy to think that since there are two modes to open files, then there are two categories of files, and that they are either binary or text.
But no, all files are binary. Some of those 1 and 0 do represent text, and for convenience programming languages include a shortcut to get those bytes and translate them into text easily.
Let's say I write some text to a file:
with open("wololo.txt", "w", encoding="ascii") as f:
f.write('oyo oyo!')
Now lets read it by telling Python to not translate it to texte:
>>> with open("wololo.txt", "rb") as f:
... print(list(f.read()))
...
[111, 121, 111, 32, 111, 121, 111, 33]
I get numbers between 0 and 255, so bytes.
Deep down, all files are composed of bytes.
Dividing files between text and binary makes no more sense than dividing files between audio and binary, video and binary or zip and binary.
There are thousands of ways to talk about the same thing
If I want to talk about the number of fingers I have on my right hand, I can describe it using numerous representations:
decimal system: 5
English word: five
roman numeral: V
mandarin symbol: 五
binary base: 101
C struct using a signed short little-endian from Python:
bytes([0b101, 0b0])
They are all arbitrary conventions, yet they all mean the exact same thing.
However, V and 101 can be understood differently depending of the context.
It's also the case for file formats.
The lie of raw text
For decades, people propagated on the internet this notion that something is raw text, or even worse, "ANSI text", to justify that some files don't need some specialized tooling to process.
What really happens is that text, like any format, is just bytes in a certain order, matching a spec. If you follow the spec, you analyse the bytes, and translate those bytes into symbols.
It's the same for music, you read the bytes, you may follow the FLAC or MP3 spec, and get sound. For video, or archives, same. It's all about the convention humans agree on to say, "those arbitrary bytes mean this stuff in this context".
This agreement is often called a format or a standard.
For text, it's the same, we have something called a character set and an encoding specification, and if you follow them, you can read bytes and get sense out of them.
You cannot read bytes without knowing the agreement, because bytes don't mean anything by themself. In fact, the same bytes, using different agreements, mean different things.
E.G, the numbers "195" followed by "182" ("11000011" and "10110110" in binary), can be understood differently if you chose to interpret them in "utf8":
>>> bytes([0b11000011, 0b10110110]).decode("utf8")
'ö'
That's a French letter O with a dieresis.
And if you interpret it in "gb2312":
>>> bytes([0b11000011, 0b10110110]).decode("gb2312")
'枚'
We get a Chinese character used to talk about flat and thin objects, such as a sheet of paper.
Same values, but depending of the agreement we use, we understand something different.
So there is not even such a thing as "raw" text, because any text can only be understood if you know the charset and encoding used to produce it.
Most likely, though, when people say "raw text", they likely talk about text using the ASCII character set. This is a very basic, but universally understood format that even the most primitive computer system can read and display without any problem. If you click on a file containing ASCII on Windows, it will open notepad.exe, decode the content and show you the text seamlessly, giving you the impression there is no work behind this operation.
Plus ASCII can represent all English words, which mean the Americans can get away with using it everywhere for a long time without having to deal with those pesky foreigner glyphs.
At this point you probably understand that when you see some guy in a movie telling you he can read "binary", it means absolutely nothing. At best, it may mean they can decipher ASCII text from a hexadecimal or binary representation (or a matrix wallpaper).
Which is cool, don't get me wrong.
I can't do it.
Where Python helps experts and tricks beginners
If you create bytes and print them, the Python shell may show you letters in something that looks like a string:
>>> bytes([0b1100001, 0b1100010, 0b1100011, 0b1100100])
b'abcd'
This is a great source of confusion for beginners.
But note the b'
, this is a subtle clue that the object is not a string, but bytes:
>>> type(b'abcd')
<class 'bytes'>
>>> type('abcd')
<class 'str'>
Wait, we have bytes, a sequence of numbers, and bytes can mean something else than text, so why when I create bytes Python shows me something that looks like text?
Well, it's a convenience feature for professional devs that often work with bytes.
You see, if you have to deal with bytes, you will see protocols that mix ASCII text and other formats. Also, if you are used to it, bytes are easier to read using an hexadecimal notation.
So when Python show you bytes, it shows everything that can be mapped to an ASCII character as such, and anything else using an hexadecimal representation:
>>> bytes([0, 1, 200, 245]) # hex
b'\x00\x01\xc8\xf5'
>>> bytes([44, 45, 46]) # ascii
b',-.'
>>> bytes([0, 1, 44, 200, 45, 245, 46]) # both
b'\x00\x01,\xc8-\xf5.'
>>> 0xf5 # F5 is 245 in hex
245
Yeah, I know, it's doesn't look like it's easier, but it's an acquired taste.
Python may show you total giberish, but an experienced low level dev will see the number behind.
It's also a nice shortcut. It's easier to type b"abcd"
than bytes([0b1100001, 0b1100010, 0b1100011, 0b1100100])
. I only used the latter to make a point at the start of the article.
But yes, if you don't pay attention, you may be getting the wrong impression.
The point is, here, we don’t use a number to represent text, we use text to represent a number:
>>> b"\xf5"[0] == 245 == 0xF5 # different notations, same value
True
A few recommendations when dealing with text
Be explicit when you read and write text files, specify the encoding (e.g:
open(path, encoding=...)
). If you don't know which one to use, "utf8" is a good start.If you create text data, make sure you document which encoding it uses somewhere.
Don't mention "raw text" or "pure text", it doesn't mean anything. And certainly don't use the term "ANSI text", which has no well-defined meaning. Communicate the encoding. If it's only using characters from python's
string.printable
, ("0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ \t\n\r\x0b\x0c"), it's safe to call it "ASCII text".“ASCII” and “ANSI” look similar, but the first one is helpful, while the second is completely different, and confusing.
I imagine many older IT folk are squirming a bit. In the old days, the operating systems actually provided distinct interfaces for dealing with a few different kinds of files. If you tried to open a file with the wrong interface, you application would fail, often with no opportunity to recover. Unix systems introduces a great simplification where all i/o devices were presented as a file (or a file-like) object in the file tree. You used a common simple open or close, read or write, seek or tell, and control or status API. Life became much simpler for developers, in some ways, and more complex in others. We had to design, and hopefully standardize, the file formats mentioned in the article to store audio, video, graphics, etc.
The language run-time libraries provide the ability to use such an entity as a line-oriented text file, or a record oriented FORTRAN file or a COBOL block oriented file or just a plain stream of bytes. The *nix OS directly provides the plain stream of bytes, with all the other formats built on top of that. Probably the most common is the line-oriented text file. You ask the run time to give you lines of bytes which are separated by new-line characters, and you asked the run-time to write lines of bytes separated by newline chanters. Life was good.
Then in the 1980's along came the IBM PC and the Apple Macintosh. The PCDOS OS used a carriage return and a newline, and MACOS OS used just a carriage return, to separate lines in "text" files. By this time, the internet (but not yet as we know it now) linked thousands of computers together, and facilitated the transfer of data (often via files). Inter-op was a big buzz word and selling feature - it was very important for computers to talk to each other. And that meant that the new microcomputers had to talk to the rest of the world, generally be exchanging files. That is when the distinction between binary and text files became critical. Our apps had to know which of three conventions was used to separate lines in the text files.
Modern versions of Python have a feature called universal newlines. When you open a file as a text file, e.g. with mode='T', the Python run-time library will discover the new line convention used in the file and properly separate the byte stream into lines for you. And when you write to a text file, the run-time will use the usual line separator character(s) for the OS between the lines you write. When you open a text file in binary mode, you may need to perform this discovery yourself, and check which OS you are running on to determine the default line separator.
Under the hood, so to speak, the Python run-time is accessing a binary stream of bytes. It provides convenient services that depend on the file mode to make life easier for all Python developers. All we need to know is how we want to use the file. And hopefully that matches to content of the file.