MicroProject: Pyfer

A Simple Cipher Algorithm Built in Python

Elby
12 min readOct 2, 2020

I don’t think I’m exaggerating when I say that one of the worst things in the world is having to scroll through an entire life story before you finally get to the actual instructions of a recipe. I therefore apologise in advance for what follows.

Most of my projects have a similar origin story. Generally, they begin when my interest is piqued after reading a book or an article somewhere (by which I mean falling into a Wiki-hole), and I find myself entertaining a thought of the form: “I wonder if I could… do that / …build that from scratch / …improve that” and so on. I like to call this the “IWIC” moment — the moment at which curiosity turns into a challenge. What follows is typically a cycle of headaches, pacing, and increasingly refined Google searches; but every step, however small, is accompanied by a satisfying little knowledge nugget. I find this to be what is so immensely gratifying about coding and, without wanting to get too existential here, life in general (solving problems, that is — not nuggets).

IWIC moments are why I decided to learn to code in the first place. Before that, I spent large amounts of (company) time working on pet projects using Excel, one of which was the first version of what would eventually become Pyfer.

End of autobiography.

Pyfer: What It Is and How It Works

Pyfer is a simple cipher algorithm built in Python with an extremely clever and to-the-point name. Its primary use (its only use, really) is to encrypt or decrypt a message by substituting letters, numbers, and symbols, drawn from a scrambled grid of characters, for the characters in the message. A readable message (also known as plaintext) is fed into Pyfer, resulting in a jumbled mess (ciphertext) that requires key in order to be deciphered. In cryptography-parlance, Pyfer is a combination of symmetric classical transposition and monoalphabetic substitution ciphers, which doesn’t mean much other than it is best explained using an example.

Let’s say we want to encrypt the following message:

'Hello_world!'

There are lots of algorithms we could use, some of which have been around for thousands of years. In general, they all require the following three things in some form:

1. a message to encrypt

2. an encryption machine

3. an encryption key

Having decided what our message will be, we now need an encryption machine and an encryption key.

An encryption machine is a mechanism by which the plaintext message is converted into ciphertext. This is the actual algorithm behind the encryption; the sequence of instructions to follow in order to turn the readable message into an unreadable bowl of mashed potatoes. While the encryption machine is more of a process than an actual machine, most modern encryption algorithms are so complex that they can only be executed by powerful computers. These algorithms form the basis of what is known as “strong cryptography” and are differentiated from “classical” encryption algorithms by the fact that they cannot be worked out by hand. Pyfer uses a classical algorithm, although it is fairly long-winded, as we shall see below. The encryption key is an independent and variable piece of information, usually in the form of a string of letters, digits, and/or other characters, that tells the machine how to configure itself. In other words, the key tells the machine which specific bowl of mashed potatoes to make.

The mark of quality for an encryption algorithm is its ability to withstand attack by cryptanalysis, i.e. how difficult it is to crack the code. Because classical algorithms are simple enough to be worked out by hand, they are generally quite easy to crack. Simple cracking techniques include brute force attacks, where the attacker tries every possible key until one of them works; and frequency analysis, which seeks out character patterns in the plaintext that show up in the ciphertext, thereby eliminating the need for a key. There are, of course, far more sophisticated cracking techniques available today, but we need not concern ourselves with them here (for more information, I thoroughly recommend Simon Singh’s The Code Book). The goal with Pyfer was to create an encryption algorithm simple enough to be worked out by hand, but complex enough to withstand both brute force attacks and frequency analysis. Here’s how it works…

Step 1: The Key

We have our message and our machine (the Pyfer machine), now we just need a key. The following randomly generated key will do:

'591859582504828990688327803012971805820837330'

Step 2: Sub-Key Generation

Next, we split the key into five parts. For each of these five parts, a new sequence of numbers (a “sub-key”) is obtained by taking the position of each number, relative to the key-part, that would sort the key-part in ascending order. For instance, if we wanted to sort the first key-part in ascending order, we would start with the 1 (since there are no 0s), which is in position 2 — Queue horrible record scratch noise! Don’t you mean position 3, you plum? Well, normally it would be position 3, but Python uses what is called zero-based indexing, meaning the first item in a list is actually in position 0 rather than position 1 — so the first number in the sub-key will be 2. Next, we look for any instances of the number 2 in the key-part, and we find one in position 8, so we add 8 to the sub-key. The next largest number in the key-part is a 5, of which there are three; in positions 0, 4, and 6. We add 0, 4, and 6 to the sub-key, and so on. Eventually we end up with five new sequences of numbers:

This step, while difficult to explain, is rather straightforward in Python (demonstrating the magic that is NumPy):

key_x_elements = []for i in key[0:9]:     key_x_elements.append(int(i))     x_key = np.argsort(np.array(key_x_elements))

And so on for y_key, x2_key, y2_key, and z_key. (For the remainder of the illustration, a code snippet will be included at each step.)

Step 3: Grid Shuffle

Within our machine we have a grid containing all of the characters we can use in our encryption:

The grid contains lowercase letters, uppercase letters, digits, and 19 symbols, and it is the definitive list of characters. This means that the Pyfer machine cannot encrypt messages that contain characters that do not appear in the grid.

The grid is arranged as a collection of nine rows and nine columns (again, starting at 0 rather than 1). If we use the x_key sub-key as a guide, we can rearrange the columns like so:

reshuffle_1 = char_grid[:, x_key]

The shuffled grid is then flattened out into a 3-by-27 shape, and then transposed along the diagonal, resulting in a 27-by-3 grid:

reshuffle_2 = reshuffle_1.reshape(3, int((9 ** 2) / 3)).transpose()

The 27-by-3 grid is reshaped back into a 9-by-9 grid:

reshuffle_3 = reshuffle_2.reshape(9, 9)

Finally, as for the first reshuffle, the grid is reorganised according to the sub-key. This time, the rows are rearranged in the order given by y_key:

reshuffle_4 = reshuffle_3[y_key, :]

This gives us a character grid that is somewhat scrambled, but not quite scrambled enough. Why not do the whole thing again? This time using the x2_key and y2_key sub-keys:

reshuffle_5 = reshuffle_4[:, x2_key]
reshuffle_6 = reshuffle_5.reshape(3, int((9 ** 2) / 3)).transpose()
reshuffle_7 = reshuffle_6.reshape(9, 9)
reshuffle_8 = reshuffle_7[y2_key, :]

Because we have nothing better to do, let’s repeat the entire process with the z_key sub-key (for both row and column reordering):

reshuffle_9 = reshuffle_8[:, z_key]reshuffle_10 = reshuffle_9.reshape(3, int((9 ** 2) / 3)).transpose()reshuffle_11 = reshuffle_10.reshape(9, 9)reshuffle_12 = reshuffle_11[z_key, :]

Golly, that’s a beautifully scrambled character grid, I hear you say. Well thank you, anonymous readership statistic!

Step 4: Transposition and Substitution

We can now use the scrambled grid to encrypt our message. We do this by identifying the coordinate pairs in the scrambled grid for each character in our message. The coordinates form a 2-by-n array (where n is the number of characters in the message), which is then transposed along the diagonal, forming an n-by-2 array, before finally being reshaped back into a 2-by-n array. We then use the resulting array as a list of coordinates from which to look up characters in the scrambled grid and substitute them for the original characters:

And there we have it! The output obtained from the substitution is the encrypted message or ciphertext: 'MjyhE/KX=dXF'. In Python, the whole thing looks like this:

In: from pyfer import cryptIn: crypt.Machine('591859582504828990688327803012971805820837330').scramble('Hello_world!')Out: 'MjyhE/KX=dXF'

And of course, we can also decrypt messages:

In: crypt.Machine('278783118045432772940696442119590650027167169').unscramble('Z*gb+K.C75olz9Vkqq')Out: 'Well_howdie_there!'

And that, as they say, is that.

Note that the above code requires Pyfer to be installed. You can do so by running pip install pyfer in the command line.

Encryption Modes

If we were to scramble the very same message with a different key, we would obtain a different result. For example:

'191710526217107479125403646661927934823045548' gives 'F8)u8HaHjXi,'

'471343254537934916226523674566968223422225547' gives 'TOLwzCX5poNP'

All of these will draw characters from a scrambled version of the original character grid, meaning that they are likely to contain lowercase and uppercase letters, digits, and symbols. Now, the situation might require that the ciphertext contain certain character types, letters and numbers only, for example. In this case, we can change the encryption mode so that the machine only accepts and returns a limited set of characters, and we can do so simply by changing the length of the key.

Pyfer has three encryption modes, each drawing from a different character set. In the above example, we used a key that was 45 digits long, and the character grid contained the full set of available characters. This was an instance of Pyfer’s “large” mode. But we can also encrypt messages in “small” or “medium” mode just by using a key of 30 or 40 digits, respectively, and leaving Pyfer to infer from the length of the key which mode it is supposed to use.

Let us illustrate this by encrypting a different message, 'i7am7a7message7in7a7bottle', in all three modes.

Small Mode (Mode 1):

· Key: '765677529163444907187562266626' (30 digits)

· Output: 'iry7dxf78sbht7vpmbopiggo4a'

Medium Mode (Mode 2):

· Key: '7656775291634449071875622666261778265348'(40 digits)

· Output: 'iZs9qsgsfib0M7iCFTaeFjzoHt'

Large Mode (Mode 3):

· Key: '765677529163444907187562266626177826534812128'(45 digits)

· Output: 'oa8pA8qXhagtufm*4jhq1>7ht('

We see that the set of characters included in the resulting ciphertext is different in each mode. Specifically, Small Mode uses only lowercase letters and digits; Medium Mode uses lowercase and uppercase letters, digits, exclamation marks, and question marks; and Large Mode uses all the characters that Medium Mode does, as well as 17 other symbols.

Photo by Ryan Hutton on Unsplash

Upsides and Downsides

As I mentioned above, the point of this MicroProject was to build a cipher algorithm in Python that was strong enough to withstand simple cryptanalysis, while being simple enough to execute by hand. While the process is somewhat involved, it is certainly executable by hand. But does it stand up to brute force attacks and frequency analysis?

In Step 2 of the process, the key is broken down into five sub-keys of nine unique numbers from the range 0 to 8. This means that there are around 6.3*10²⁷ possible sub-key configurations. A single run-through of the algorithm takes my computer 655 microseconds to complete, which is not a long time. However, 6.3*10²⁷ is such a hefty number that even if my computer was running continuously it would take more than 130-thousand-trillion years to try every single configuration. Even the world’s fastest supercomputers would need hundreds of years to try all the keys. Not too bad for an algorithm you can work out on a napkin.

The Pyfer algorithm has the additional advantage of not preserving character patterns found in the plaintext, meaning that frequency analysis will not be particularly useful. 'Hello_world!' contains three lowercase l and two lowercase o characters, but the ciphertext 'MjyhE/KX=dXF' does not contain any repeated characters. This means that patterns such as the frequency of occurrence of double letters and common letters are scrambled along with the message itself. We can conclude, therefore, that Pyfer is stronger than the 2000-year-old Caesar Cipher, as well as many other classical ciphers.

That said, Pyfer’s simplicity comes with a few quirks that may make it vulnerable to cryptanalysis. First of all, because of the way the key is transformed into five sub-keys, the number of possible sub-key configurations is considerably smaller than the number of possible keys. There are, as discussed above, approximately 6.3*10²⁷ possible sub-key configurations, which is an enormous number but is not quite as enormous as the number of possible keys, which is 10⁴⁶ — 1 (in Large Mode). This discrepancy means that certain keys will result in exactly the same sub-key configuration. In fact, for each key there are about 160 quadrillion slightly different keys that will result in the same configuration, and therefore the same character grid, and ultimately output the same ciphertext. While this is a dramatic reduction in the amount of work required of a brute force approach, the attacker would still have to try 6.3*10²⁷ unique sub-key configurations which, as we have seen, would take rather a long time. This quirk, therefore, is not really much of a problem at all.

A more pressing issue might be how Pyfer deals with repeating characters. In the final transposition and substitution step, Pyfer looks up the coordinates of each character in the plaintext and transposes them, such that the coordinates of the output characters depend on those of neighbouring characters. Each character in the ciphertext, therefore, is a product of a pair of characters in the plaintext. This is why Pyfer stands up to frequency analysis, but it also means that if plaintext characters are repeated in sequence more than once, the output will also contain character repetition. For example, using the key '278783118045432772940696442119590650027167169', the plaintext 'aa' is scrambled into the ciphertext 'tS', and the plaintext 'aaaa' is scrambled into 'ttSS', meaning that there is some noticeable pattern preservation. This, however, is not as much of an issue as it may seem to be, because the pattern itself never actually emerges. Using the same key, the plaintexts 'ab', 'abab', and 'aabb' would result in 'k9', 'kk99', and 'toS>', respectively. So, while the resulting ciphertext tells us that something is being repeated, it does not give us any clue about what that something is. And so, character patterns, just like brute force attacks, are nothing to fax home about.

Conclusion

Does this mean that Pyfer is an unbreakable code system that is being used by the CIA/FSB/MI6/KFC? The short answer: no. The long answer: unfortunately, no. Sorry. There is a reason that serious encryption algorithms are not executable by hand (or even on Excel), and that reason is that they have to be complex enough to withstand attacks that are far more sophisticated than brute force or frequency analysis. Modern cryptanalysis employs a range of extremely powerful cracking techniques against which Pyfer doesn’t stand a chance. Unfortunately (read: conveniently), these are beyond the scope of this blog post.

While Pyfer is not even remotely in the same league as modern cryptographic techniques, it performs reasonably well in most cases and meets the challenges posed by the more basic types of attack, while being simple enough to work out by hand. Ultimately, it is something that should only be used for the same reason it was built — for fun.

The source code for Pyfer, including demos and documentation, can be found in the GitHub repository. For additional information, feel free to contact info@elbydata.com.

--

--