Most of us are familiar with the concept of data encryption: masking meaningful data so that it can’t be understood by anyone besides the person or machine it is intended for. It’s what makes secure data transactions on the internet possible, and it underpins any system that provides secure access using a password. We provide a password and the system we are attempting to access verifies that password before letting us in.
It seems intuitive that anything which gets encrypted can later be decrypted: but in fact, this is very often not the case. Indeed, the password example above is a case in point. The password we supply when we create an online account is encrypted in the database of the system that manages that account, and once encrypted it never later gets decrypted. This, incidentally, is why if you forget your password, the system generally resets the account and asks you to verify your identity in some other way (e.g. via an e-mail link or SMS message) before then asking you to set the password to something new. At all times, only you know the true password value, while the system itself remains ignorant. So how does it all work?
The encryption methodology used for storing passwords in the vast majority of databases is known as hashing. A ‘hash’ is an alphanumeric representation of the original piece of data, sometimes known as a digest, produced by following the rules of a particular algorithm. Common hash algorithms include SHA-1, SHA-256 and MD5. The central principle of hashing is that while it can turn a piece of data into a hash, it cannot then reverse that process and turn the hash back into the original piece of data again. This makes them perfect for password storage, because even if someone gains access to a database of user accounts, the hashed password values won’t mean anything and can’t be decrypted even if the hacker knows the algorithm that was used to create the hash in the first place.
This is a difficult concept to grasp, because intuitively if we know the rules used to create a hash, it ought to be possible to follow those rules in reverse to deconstruct it. It is worth using a simple example to explain why this isn’t so. Imagine a system that requires users to provide a prime number as their password. Let’s say my password is ‘1367’. The hashing algorithm – we’ll call it the PRIME-1 hash – follows a simple rule set: it takes the original value, squares it and then multiplies it by the last digit of the resulting square (unless the last digit is zero, in which case it multiplies it by 10). So:
- 1367 * 1367 = 1868689
- 1868689 * 9 = 16818201
The PRIME-1 hash of my original ‘1367’ password is now ‘16818201’, which is what the system stores in its database. If someone else hacks the database, finds the value ‘16818201’, and even knows the rules of the PRIME-1 hash, they can’t begin to decrypt it. Why? Well, for starters, they don’t know what the multiplier was for step 2, which is where they’d have to start, as they would need to divide by that multiplier to get them to the result of step 1. Now, admittedly with a lot of computing power, it would be possible to break this hash fairly easily. You could try dividing ‘16818201’ by 1, 2, 3, 4, 5, 6, 7, 8, 9 and 10 and then attempt to factorise each of the results and stop when you get to a prime number, but this would take a considerable amount of time and effort. Such attempts to break hashes are known as “brute force” attacks, and they can be successful given sufficient time and computing resource. The PRIME-1 hash is actually a very weak hash and thus a brute force attack would work. Hashes like SHA-256 are so strong, however, that nobody has ever managed to break them. That’s because the rule set it uses to produce the hash is so complex and involves so many steps that the computing power required to break it would be immense, and certainly beyond the capabilities of your average hacker.
If a hash can’t be decrypted, how does the system use it to verify an unhashed password when a user attempts to log in? Well, one of the features of a hash algorithm is that the same input value will always result in the same hash output value. So if your password value is ‘password’ and the SHA-1 algorithm is used, the hash of ‘password’ will always be ‘5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8’. That means that when you log in, you supply the string ‘password’, the system hashes it and then compares the result to the hash it has stored in its database for your account. If the hashes match, the login attempt is a success. Another intriguing feature of hashes is that even the smallest variation in the input value results in an entirely different hash output. Compare these two SHA-1 hashes:
password = 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8
Password = 8be3c943b1609fffbfc51aad666d0a04adf83c9d
The only difference is the capitalisation of the ‘P’ in password but the resulting hashes are completely distinct and unrelated in any way.
A further feature of hashes is that they are always the same length, regardless of the length of the input value. Compare these two SHA-1 hashes:
a = 86f7e437faa5a7fce15d1ddcb9eaeaea377667b8
the quick brown fox jumps over the lazy dog = 16312751ef9307c3fd1afbcb993cdc80464ba0f1
In both cases, the hash is exactly 40 characters, even though the input strings are of variable length. The hash would still be 40 characters in length even if the input was over 600 characters long:
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way – in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only. = 3d40543842c230cd11e702b46ce85038ea320971
It should be clear by this point that a hash is therefore not strictly unique. Two or more inputs could result in the same hash output, known as a hash collision. However, the chances of any two input strings producing the same hash output is vanishingly small: in the case of SHA-1 the chances are 2^-160 which is incredibly small. That said, many hackers specialise in attempting to exploit hash algorithms by finding pairs of input strings that produce the same hash output and from there trying to work out how to disguise data inside hashes that appear to be for some other input. Both the SHA-1 and MD5 hash algorithms have been exploited in this way (SHA-256, to date, has not).
Hashes have uses beyond password storage. They can be used to verify transactions, as in the case of Bitcoin where ledger entries are hashed using SHA-256 and only the hash is verified by miners rather than the original transaction data itself. Hashes can be nested together into hash trees (which are essentially hashes of hashes) and only the ‘header’ hash is required to validate the tree. This means that a lot of information can be captured in a series of hashes, but only one hash value is required to validate the data set. This speeds up the time needed to validate or verify information. Entire files can be reduced to a hash for the purposes of reconciliation or change management (if the last recorded hash value for a file is no longer the same as the current hash value, it indicates the underlying file has changed in some way).
In short, a hash is a one way street that allows for the encryption of data without the ability – or even the requirement – to decrypt it back again. The hash acts as a digital placeholder that represents the original input data without actually having any meaningful content of its own.