Saturday, October 29, 2011

Encryption and Ciphers - Part 2


Welcome back (or just "welcome" to the people that didn't read the last entry, if you are one of those, I highly encourage you to atleast skim it since I use a lot of that terminology here). 

So, last time, we discussed the Caesar Cipher  and why it is so easy to be cracked. So, if it is easy exploitable, what's the alternative, well, the next step up would be a polyalphabetic cipher. 

One such famous cipher is called the Vigenere Cipher. To use it, both parties must agree on a passphrase, such as "EYE". Let the converted-to-numbers passphrase be represented by the set $M$. To get the ciphertext (say the plaintext is "ATTACK") , we repeated the set $M$ inside set $N$ until the length of $N$ is the same as the length of the plaintext set $P$, and then:

$P_i + N_i \equiv S_i (\mod{26})$

Lets do an example. With the plaintext "ATTACK", and the keyphrase "EYE", we write down the following:

$P = \{1, 20, 20,1, 3, 11\}$
$N = \{3,25, 3, 3,25, 3\}$

Now, to get the ciphertext in integers ,$S$, we just add them up, modulo 26:

$S = \{4,19, 23,4, 2, 14\}$

Which translates to: "DSWDBN"

Now, why is this better? In the plaintext, we have two T's, but, in the ciphertext, there are no letters that are repeated twice. This means that this cipher does not preserve frequency. Why is that important? That means that one cannot use frequency histograms anymore, because the ciphertext no longer has the same frequencies of letters as the plaintext! 

So, that means that the Vigenere Cipher is more resilient to statistical methods of attack than the Caesar Cipher, but, what about the dictionary/brute-force attack we covered last time? Well, lets see. 

For the Caesar Cipher, we only have to cover 26 shifts, whereas for the Vignere cipher, things are much more complicated. We must consider every possible passphrase, so, if we are given the ciphertext, 
and it is 12 characters long, for example, we must test for all passphrases that are 1 character long, 2 characters long and so on until 12 characters long. So, if we have a ciphertext and it is length is $n$ characters, we see that the number of tries we must do is:

$26^1 + 26^2 + 26^3 + ... + 26^n = \sum_{i=1}^{n}26^i$

Since that's a geometric series,

$\frac{26(1-26^n)}{(1-26)}$

That's a lot, even for a computer. If implement a cracking algorithm, it'll run in exponential time, which is really bad.

But, what happens if you can guess how large the length of the passphrase is (just imagine, for a moment)? That becomes only $26^n$ tries, and it would turn into a sort of modified Caesar Cipher. That's not too bad, right? 

Well, but, how would we guess the length of the passphrase? A clever chap named Friedrich Kasiski figured this one out for us. Its not always effective, but, when it is, it reduces computation time greatly. Basically, repeated words in the plaintext may by chance have the same letters in the passphrase they were encrypted by. Where this is true, frequency is preserved, and we can see how long the passphrase is. Of course, this would only work if there were any repeated words at all, and also if the repeated words coincided exactly with the passphrase used. So, it obviously isn't precise, but, if it does work, it works brilliantly.  

In the next post, we'll take a big jump, and take a look at the Knapsack Cryptosystem. 

As always, feedback is greatly welcomed, whether negative or positive doesn't matter. 

Also, if you guys have any other ideas for articles you would like me to do, please comment below. 

I  haven't updated the github repo (https://github.com/Poincare/cryptoBlog) as of now, so, if someone could do that, that would be awesome (I'll probably do it before the end of today, my git needs reinstalling). 

Friday, October 28, 2011

Encryption and Ciphers - Part 1

This is starting a three or four part series on encryption and ciphers, going from simple ciphers such as the Caesar Cipher to industry strength encryption methods like RSA (maybe even Blowfish if I cover that much stuff). This assumes that you're familiar with atleast a little bit of modular arithmetic (http://en.wikipedia.org/wiki/Modular_arithmetic).

So, lets get started.

Firstly, what is encryption? Let's get a definition down. Encryption is taking "plaintext" (text everyone concerned can read and understand, including people that you would rather wouldn't read your message) and converting it to "ciphertext", which only you and the parties that understand how to go from ciphertext to plaintext (a.k.a the recipient) should understand.

That sounds pretty good, right? 

Now, for some basic knowledge I'll be using throughout the article, so, pay attention. When I say string, I mean a string of characters (which should be self-explanatory for the developers here). When I say convert a string to numbers/integers, I mean that we map "a" to "1", "b" to "2", "c" to "3" and so on, until "z" to "26". It would be a good idea to make a table of this mapping on a piece of paper, because you'll be using it a lot if you follow along.

Despite encryption sounding like something invented in the 20th century (or maybe even more recent than that), encryption originated a long time ago. A really long time ago. Caesar (yeah, that dude from Rome you learned about in high school) needed to send secret messages to his allies that could not read even if the messenger was attacked. His men invented a system called the Caesar Cipher. Its a very simple idea, you convert the message to numbers (and call the resulting set of numbers $P$), and you decide on a shift value, $n$ (which the other party knows of as well), and you do this:

$P_i+n \equiv S_i (mod 26)  ....      (1)$

where $i$ ranges from 1 to the length of $P$ and the set $S$ gives us the numbers for our ciphertext, and, from the numbers we get the actual ciphertext. This might look complicated, but, its quite clear with a little bit of thought. Basically, this, instead of mapping from letters to numbers, we map from letters to other letters. For example, with a shift of 3, we would map "a" to "d", "b" to "e" and so on (and because of the $mod 26$ the z wraps around) all the way to "z" to "c". Similarly, the decryption algorithm would be:

$S_i-n \equiv P_i (mod 26)$ 

This we can get just by manipulating (1) a bit.

Lets do a quick example. Encrypt "attack" with a shift of three. 

Convert to numbers:

$P = \{1,20,20,1,3,11\}$

Apply shift, and mod:

$S = \{1+3,20+3,20+3,1+3,3+3,11+3\}  (mod 26)$
$S = \{4, 23, 23, 4, 6, 14\}$

Result (this is why I told you to make the table): "DWWDFN"

Okay, great, now we know how to use a Caesar Cipher. Of course, it were bulletproof, we wouldn't have people that do PhD's on encryption. The Caesar Cipher is what is known as a substitution cipher, it takes one letter and replaces it with a different one. The problem is, it doesn't change the frequency of the letters,  we can see in the "attack" example that the two T's are now represented as two W's. This is easy to "crack" (going from ciphertext to plaintext without having the shift) by statistical methods. If we have this chart: 


We can try ever shift value (only 26 of them, can't be too hard) and check if the suggested plaintext has a letter distribution matching the one above. The closest match should be the right one!

And, we can also dictionary force it. If we try all the shift values and get a suggested plaintext for each one and check how many dictionary words are formed for each of these, the plaintext with the most matched dictionary words is most likely the one we are looking for.

It may seem a bit cumbersome to go through 26 different shift indices, and it would probably take some time, but, computers can do it before you can say "Caesar was a liar". Literally.

So, how would we solve this problem? 

That's the topic for the next one :)

Code for this article is located here: https://github.com/Poincare/cryptoBlog
Only in Python right now, but, I should have a C version up soon, and I would love ports to other languages, please fork and submit a pull request and I will accept it.

Comments are highly appreciated, and I would love it if you guys gave me any feedback at all (even if it is negative). Redditors, upvotes are very welcome.

I know this one was a bit light, and did not cover a ton of material, but, I aim to remedy that with the next one (I assume starting slow would be better). 

Sunday, October 2, 2011

Not having TCO pretty much kills Javascript

Javascript is a funny language to me. I have things that I really like about it, like higher order functions and being a prototype based language (more on that later). There are also things that I absolutely hate about it like the + and === operators, and how the prototype based things really only half work. The thing that really ruins my experience with the language, however, is tail call optimization.

As many of you now know, Ecmascript 4 does not support tail call optimization. For those that don't know, tail call recursion is where the interpreter takes a recursive function and turns it into one that isn't recursive, so, you aren't limited by stack size. Tail call optimization (TCO) is the heartbeat of functional programming languages and the functional paradigm, without it, functional programming really isn't possible on the current type of processors and memory models we have (I swear I'll be the first one on the planet to get myself a graph reduction processor).

So, basically, by not supporting TCO, Javascript turns from a language that you could (if you tried a little) write functional code in, into a language that  really wants to be functional, but can't, and one that really wants to be prototypal and can't.

Now, about prototypes. If you want to understand why Javascript's prototypes really, really suck, go learn a language like Io. Javascript basically ruins that elegance by trying to make it easier for Java/C++ programmers to understand, which would be okay, if it actually did that, but, as we all know, prototypes in Javascript are probably the most misunderstood concept in the entire language.

I would okay with all of the bad things of Javascript, if only it had TCO. If it had TCO, I would have probably never learned Scheme, and I would have loved to write form validators and login pages all day long with all the recursive goodness, but, someone went and screwed that up, so, we're all stuck with what we have.