In this series of posts we will examine exactly how data is stored by a computer. In my previous post, we looked at the world of numbers and how they can be stored and worked upon in binary, including the very large and the very small thanks to the magic of floating point. In this post, we will look at how a computer is able to store and represent text – once again, using just ones and zeros.
In computer terminology, a piece of text is more commonly referred to as a ‘string’. As much as anything else, this helps avoid any assumptions about the content, because strings can actually include any characters, not only the letters A-Z (in both lower and upper case), but also the digits 0-9, punctuation marks, spaces, tabs, line breaks and other ‘special characters’. A string can be as short as a single character or longer than War and Peace. Ultimately, however, every string can be broken down into its individual characters, which in turn are made up of bytes, which are themselves nothing more than eight bits of 1 or 0 combinations. But how does that work in practice?
In order for a computer to store an individual character, for example the upper case letter ‘H’, it needs to first convert it into a byte; but as we learnt previously, bytes are nothing more than a series of 1 or 0 bits that make up an 8-bit binary number, and our letter ‘H’ is not a number of any kind. As we saw with hexadecimal, however, a binary number can be represented as something else (in hex, as two characters between 0-9 and A-F). Similarly, it is also possible to represent a letter as a number, which can then be stored as binary. Representing letters (and other characters, like punctuation marks, spaces and symbols) as numbers is called ‘encoding’, and it requires the use of something called a ‘character set’, which is really nothing more than an internal lookup table. One of the oldest character sets is called ASCII (American Standard Code for Information Interchange), which provides a lookup for 128 different characters, including the letters A-Z in both lower and upper case. Historically, the ASCII character set was loaded into memory by the computer from a ROM (read-only memory) chip so that it could be accessed whenever it was given an instruction to store or retrieve string data. Nowadays ASCII, along with a whole bunch of other character sets, are part and parcel of any operating system (Windows, Linux, macOS etc.) but they are still used in the same fundamental way. Going back to our original example, the letter ‘H’ is encoded via the ASCII character set as the integer 72, which in binary becomes 01001000, and that can be stored as a single byte by the computer. When a user wants to retrieve that byte from memory, then provided the program retrieving it knows to use the ASCII character set, the byte 01001000 will be translated so that it appears on screen as ‘H’. Without the character set, however, the computer would just display the number 72, which is part of the reason why character sets are so important. Remember that a string could include numeric characters as well, leading to the confusing notion that the number 49 is used in ASCII to represent the numeric character ‘1’. Without employing the correct character set, you can see how things could go very wrong very quickly!
The ASCII character set has a long pedigree dating back to the 1960s, originally having been derived from a lookup used for sending telegraph messages. It is limited, however, because it can only support up to 128 different characters. This is because ASCII only uses 7 bits of the total 8 bits available in one byte. Some later versions of ASCII, such as ANSI, have extended the original ASCII character set to 256 different characters (by employing the 8th bit), since as we learnt with numbers, one byte can be used to store up to 256 different numbers. This provides greater scope for including accented letters (e.g. ö, ç, ù etc.) in order to support text in most European languages. However, languages like Chinese which employ ideograms rather than letters need many thousands of different characters, far more than 256. At this point, poor old ASCII – extended or not – is no help to us. However, the joy of bytes is that their combined capacity rises exponentially: if we use two bytes to store each character, we can have 65,536 different characters (for more on why this is the case, see here). The Unicode character sets, of which UTF-8 is the most commonly used on the Internet, do just this; in fact, UTF-8 uses up to four bytes for a single character. Of course, using additional bytes increases the storage size needed for text-based data, but UTF-8 neatly sidesteps this problem by using something called ‘variable width’ character encoding. That is, for the first 128 characters – which are the same as they are in ASCII – it uses just a single byte. For higher characters, it expands by one, two or three additional bytes depending on the character required. This avoids having to use four bytes for every one character, while maintaining the flexibility of supporting a much wider variety of characters than ASCII is capable of.
Using Windows Notepad, you can actually see the difference in terms of how string data is encoded and stored by a computer. If you create an empty text file in Notepad, you will see that it is 0 bytes in size:
This is because a file – even with a specified name – is nothing more than a pointer to a storage location until or unless the file has content. All the while the file remains empty, it is just a pointer to a location that contains 0 bytes of data. If you now open the file, type in a single letter and save it, the file size increases to exactly 1 byte. You will have to check the file properties to see this, however, because Windows Explorer rounds up to the nearest kilobyte!
In hex, the file would look like this:
Now save the file as a different file name, but in the ‘Save As’ dialogue box, choose ‘UTF-8’ encoding:
If you check the file properties for this new file, you will see that it has grown to 4 bytes. In hex, the file now looks like this:
&EF &BB &BF &48
What is occupying the additional 3 bytes? The answer is something called a ‘byte order mark’ (BOM), which is a standard three byte piece of information telling any program that opens this file that it has been encoded in UTF-8. The same three bytes (in hex: &EF &BB &BF) will appear at the start of any text file encoded in UTF-8 by programs like Windows Notepad. It is not a requirement that something encoded in UTF-8 should have this byte order mark at the start, and many programs do omit it, including many web-based applications. However, for programs that expect to have one, if it is not there the program will assume (possibly incorrectly!) that the text information has been encoded in ASCII, because ASCII never has a byte order mark.
If you now add just one more letter to the file and continue saving it using UTF-8 encoding, it will grow by just one further byte:
&EF &BB &BF &48 &41
This is because, as we discussed above, UTF-8 is a variable width character encoding, and the additional character used here (the letter ‘A’) does not fall outside the first 128 characters. We can test this theory further by adding a third character to the file, this time from the Cyrillic alphabet. The character Ю (upper case ‘Yu’) is not part of the first 128 characters, but it is within the first 65,536 so it will require two bytes in total to store. Sure enough, when we check the file properties now, the file has grown from 5 bytes to 7 bytes:
In hex, you can see that the one additional character has given us two more bytes (&D0 and &AE):
&EF &BB &BF &48 &41 &D0 &AE
What would happen if we now save this same file using ANSI? If you try it, you will receive a warning from Notepad stating that the file contains characters in Unicode format which will be lost if you continue. Click ‘OK’ anyway, and then have a look at the file properties… the file has shrunk to 3 bytes!
In hex, the file now looks like this:
&48 &41 &3F
This is because the leading three bytes of the byte order mark have been removed (remember, neither ASCII nor ANSI use byte order marks) and the extra byte (&AE) for our Cyrillic letter has also been lost. In fact, it is worse: even the first byte (&D0) of the two-byte Cyrillic character has been corrupted. If you open the file, you will see that instead of Ю, we have got a question mark (in hex: &3F). This is Notepad’s way of indicating that it no longer knows what the character ought to be, since the additional byte of data has been lost.
Incidentally, if you want to look at how files actually appear to a computer in both hex and in binary, it can be fun (for a given value of ‘fun’…) to use a hex editor. These tools let you view files in their hex representation. HxD is a good, free hex editor that I like to use, and can be downloaded here.
This hopefully demonstrates both how character sets work in practice, and their importance to a computer when storing text-based data. In the next post, we will explore dates…