Saturday, June 30, 2012

The Final Word on the LinkedIn Leak


As you are undoubtedly aware of by now, two weeks ago the professional networking site LinkedIn became the victim of a rather unfortunate mishap: they sprung a little leak, and 6.4 million password hashes trickled out onto the internet. And in those two short weeks, hundreds of security experts the world over, all of various backgrounds whose hats range from white to black, have been feverishly clawing their way through that list in an attempt to crack all 6.4 million passwords. However, few have made more progress in their pursuit than my associate d3ad0ne and me.

Update June 30, 2012: Per Thorsheim and his colleague Tom K. Tørrissen at EVRY have made an infographic based on my data & statistics. You can find it in this blog post.
Update July 2, 2012: Updated table to include GPU speeds for sha512crypt and bcrypt from John the Ripper 1.7.9-jumbo6.
 Update July 3, 2012: Updated 'Pass Phrases' section to include the number of passwords that were at least 16 characters long.
Update May 1, 2013: Corrected GPU name in SHA1 performance table, from "HD6690" to "HD6990." It took me almost a year to catch this mistake!

Surviving on little more than furious passion for many sleepless days, we now have over 90% of the leaked passwords recovered. And while other, presumably less motivated individuals were quick to toss out rather meaningless statistics after cracking as little as one quarter of one percent of the leaked passwords, I am a bit disappointed that I am unable to provide you statistics on 100% of them. However, when leaks such as this occur, a 90% - 95% recovery rate seems to be about par.

The password hash list floating around the internet contains 6,458,020 unique password hashes. However, as many were quick to point out, the majority of the password hashes were mangled: somehow, someway, the first five digits of 3,521,180 of the SHA-1 hashes had been replaced with zeros.

Initial speculation and murmurings around various hacker circles concluded that these mangled hashes must have been the ones already cracked by those who perpetrated the breach. However, such theories do not hold much water when taken into account that 670,781 of the mangled hashes are duplicates of the remaining, non-mangled hashes.

The opinion shared by several esteemed password crackers is that the hash list leaked to the internet was intended to only contain unique hashes, and something along the way didn’t quite go as planned. If the hashes were obtained through SQL injection, there are plenty of things that could have gone wrong. Or, perhaps someone miffed a sed command or similar while attempting to extract the hashes from the database dump. The fact is nobody knows for sure why over half of the hashes are mangled, except perhaps those who did the mangling – and even they may not be sure themselves.


Top 30 Passwords?

There are two important conclusions to reach if you are of the belief that the list was intended to only include unique hashes. First, it is impossible to know exactly how many accounts have been compromised or how many users share the same password. Second, it is therefore impossible to even begin to know what the “Top 30 Passwords” on LinkedIn are, or were. To state that “link” and “1234” are the most used passwords on LinkedIn would be a lie, as from what was provided that information is impossible to know.

What is possible to know, however, is how many passwords share a common base word – the word used to form the password, which usually has additional characters or numbers added to it. For example, we do know that 46,193 of the recovered passwords contain some form of “linkedin” in them. The word “link” was found an additional 12,996 times, while “linked” was found in 7,806 more passwords. The word “love” occurs 21,042 times. And, not surprisingly, some form of “password” occurs in 4,248 passwords, while “pass” occurs and additional 8,008 times.

Top 15 Base Words Used in LinkedIn Passwords
     1.       linkedin
46,193
     2.       love
21,042
     3.       link
12,996
     4.       anna
9,545
     5.       pass
8,008
     6.       linked
7,806
     7.       jack
7,258
     8.       blue
7,234
     9.       john
6,576
     10.   mark
5,525
     11.   mike
5,424
     12.   chris
5,050
     13.   nick
4,751
     14.   paul
4,499
     15.   password
4,486

The base words in bold more than likely have a connection to LinkedIn, and bring up two interesting potential trends.

For years we’ve been instructing people to use a unique password for each site, but it now appears that far too many people have interpreted that advice as “use the same password for each site, but add the site name to the password to make it unique to that site.” Clever, but here’s the fatal flaw in your plan: if I find out your LinkedIn password is “LINKEDINWillem01,” I’m pretty sure I know what your Facebook and PayPal passwords are, too.

The same goes for people who are using the site’s name or URL itself as their password. By our count, nearly 200 unique passwords contain some form of “linkedin.com” in them. The most noticeable trends were “linked.comNNNN,” where NNNN is typically a year, and NAME@linkedin.com, where NAME is someone’s first or last name. A handful of people also went as far as to use passwords like “linkedinpass” or “thisismylinkedinpassword.” I’ll take one guess at what their bank account password is.

NEW RULE: Sites should not allow the use of their site name or other common base words in user’s passwords.
Twitter is already following this rule:



The other trend we’re beginning to see emerge is people basing their passwords on the primary colors of the website they’re on. For instance, LinkedIn’s primary colors are blue and gray, both of which rounds out the Top 10 list.

Top 10 Colors Used in LinkedIn Passwords
     1.       blue
998
     2.       green
665
     3.       red
445
     4.       orange
417
     5.       purple
398
     6.       pink
304
     7.       black
281
     8.       brown
186
     9.       white
157
     10.   gray
146

The big question here is can the color blue be somehow connected to LinkedIn? It could simply be that blue is just a wildly popular color, but I can’t help but feel that seeing the color blue on the site made people more inclined to pick that color. Green and red are both very popular colors as well and there were 40% more blue-based passwords than green. Gray also surprises me – do that many people really love the dreary color gray, or were they inspired by what they saw? There were nearly three times more gray-based passwords than lime-based passwords, and that’s even including passwords where it was impossible to distinguish if the password was referring to the color or the fruit. It’s impossible to draw any hard conclusions, but this is a potential trend that should be kept an eye on in the future.

Password Re-use

One thing we know that will always be true is that people tend to select the same passwords. When the social media company RockYou was compromised in December 2009, details for 32 million accounts were leaked to the internet. And out of those 32 million exposed passwords, only 14.3 million were unique.

On a seemingly unrelated note, I have a wordlist that I maintain that contains nothing but real-world passwords from actual security breaches such as RockYou. It is currently almost six gigabytes in size, containing over 500 million unique passwords from sites all over the world.

So, what would happen if I were to run my real-world password wordlist through the LinkedIn hashes? The answer is I would crack 1.4 million of the 6.4 million password hashes in a matter of seconds. 21% of LinkedIn passwords were used as-is on other sites!

If we apply some logical rules to those real-world passwords we pick up another 2 million passwords, meaning an additional 31% of the passwords on LinkedIn are nearly identical to those used on other sites. We were able to recover over 52% of the LinkedIn passwords within the first two hours without really doing any work at all, simply because people everywhere think alike.
NEW RULE: Stop thinking alike! (But you were probably already thinking that.)


Pass Phrases


While the overwhelming majority of the LinkedIn passwords we cracked were between six and eight characters long, nearly 14,000 were at least 16 characters long and over 200 were at least 20 characters long – many of which were phrases consisting of four or more words. Passwords of that length should be fairly secure, so how were we able to crack so many of them?

Blame LinkedIn. The entire point of storing users’ passwords using one-way hash algorithms is to protect the passwords from being discovered, and one of the primary defenses against offline password recovery is the amount of time it takes to calculate each guessed value. If the algorithm used to hash a password can be calculated very quickly, then we can make lots of guesses at what the password might be in a very short amount of time. Conversely, if the algorithm is very slow to calculate, then only a limited number of guesses can be made in a reasonable amount of time. However, against the advice of NIST, OWASP, and other authorities on the subject, LinkedIn was storing passwords using one of the fastest-computing hash algorithms available: SHA-1.

SHA-1 was never designed to be used for password storage; it was primarily designed for message authentication and data validation. As such, SHA-1 is computationally very inexpensive and hashes can be generated very quickly – which is what you want when dealing with things like SSL or IPsec, but definitely not desirable when trying to protect users’ passwords.

Another aggravating factor is that LinkedIn did not salt users’ passwords. Salting is when you add a unique, random string of characters to a user’s password before calculating the hash, so that even if two users happen to have the same password, they will have unique password hashes. Salting passwords not only defeats Rainbow Tables (large databases which contain every possible password and its hash for a particular algorithm up to a certain length), but also reduces the number of guesses we are able to make per second since each password guess has to be hashed with each unique salt.

The last aggravating factor is that LinkedIn passwords were hashed using only one iteration of the SHA-1 algorithm. Modern password hashing algorithms typically employ thousands of iterations to make them more computationally expensive. For example, the Unix SHA512-based crypt scheme, aka sha512crypt, uses 5,000 iterations of the SHA-512 hash algorithm by default, and can be configured to use as many as one billion iterations so that it scales as computing power increases. The bcrypt algorithm -- which is vastly more computationally expensive than sha512crypt -- typically only uses 256 to 1024 iterations of the Blowfish keying algorithm by default, simply because each iteration is so expensive that it doesn't need to use more than that to be effective.

The vast majority of the LinkedIn passwords my associates and I recovered were cracked using a program called oclHashcat, which enables us to use graphics cards to crack passwords (modern graphics cards being much faster at cracking algorithms like SHA-1 than ordinary computer processors). Using four AMD Radeon HD6990 graphics cards, I am able to make about 15.5 billion guesses per second using the SHA-1 algorithm.

If that sounds like a lot, that’s because it is a lot. Even on my Intel Core i7 processor, I can crack SHA-1 at a rate of 98 million guesses per second using a program called John the Ripper, which is still very fast. Compare that to an algorithm like bcrypt, which I can crack at a rate of almost 5,000 guesses per second for five-iteration hashes using my Core i7 990X processor. Graphics cards don't help much with bcrypt either, since its design makes it very gpu-unfriendly.

Speed of SHA-1 vs. Modern Password Hashing Algorithms

Algorithm
Iterations
Software
Hardware
Guesses Per Second
SHA-1
1
John the Ripper
1.7.9-jumbo6
Intel Core i7 990X
98,000,000
SHA-1
1
oclHashcat
plus-0.09
4x AMD Radeon HD 6990
15,500,000,000
sha512crypt
5,000
John the Ripper
1.7.9-jumbo6
Intel Core i7 990X
1,800
sha512crypt
5,000
John the Ripper
1.7.9-jumbo6
ATI Radeon HD 5870
2,592
sha512crypt
5,000
John the Ripper
1.7.9-jumbo6
Nvidia GTX 580
11,405
bcrypt
32
John the Ripper
1.7.9-jumbo6
Intel Core i7 990X
4,960
bcrypt
32
John the Ripper
1.7.9-jumbo6
ATI Radeon HD 5870
1,745














  


The big question you should be asking right now is did LinkedIn developers consciously make the decision to use such a weak password storage scheme, or did they simply not consider changing the default password storage scheme for some solution that they purchased?

Either way, the responsibility falls squarely on LinkedIn’s shoulders. A solid Systems Development Lifecycle (SDLC) risk management policy would most certainly include application security reviews, where issues such as these would be unearthed by security experts and addressed by developers. So whether LinkedIn has no security leadership, their SDLC risk management program is broken or non-existent, or they chose to accept the risk associated with using a weak password storage scheme, they were doing something wrong.

And it was because LinkedIn was doing it wrong that we were able to crack as many passwords as we did. Had LinkedIn stored their users’ passwords using a computationally expensive hashing algorithm like bcrypt, we would have had to have been very selective about what kinds of attacks we ran and how long they would take to complete. But since they used a single iteration of unsalted SHA-1, we were virtually unlimited in the types of attacks we could run. We were able to throw gigabytes and gigabytes of words at the hashes, running each word through permutation filters and rules engines, even run complex combinations of attacks without having to worry about how long each attack would take to complete. And it’s because even the most complex attacks we launched finished in a matter of hours that we were able to recover as many complex passwords as we did.
NEW RULE: Sites must only store user passwords using hashing algorithms specifically intended for storing passwords, such as bcrypt. A well-defined SDLC risk management program would benefit everyone as well.
---
 HUGE thanks to Per for backing me on this and working with me to create this writeup, @d3ad0ne_ for teaming up with me to crack the last bit of hashes, and to Tom K. Tørrissen for doing the great infographic!

19 comments:

  1. "Initial speculation and murmurings around various hacker circles concluded that these mangled hashes must have been the ones already cracked by those who perpetrated the breach. However, such theories do not hold much water when taken into account that 670,781 of the mangled hashes are duplicates of the remaining, non-mangled hashes."
    These speculations rely on the fact that ^00000's hashes look far more easier to crack.
    I have found only 2/3 of the passwords, and 85% of the 00000. Maybe it's pure speculation, but as you possess more results, maybe you can verify an "easy" thing: just count the number of passwords composed with special characters in the ^00000-hashes and those in the non-^00000, from what I've seen the difference is noticeable, though I don't have as many passwords than you to make this guess.
    If the hypothesis that the 00000 passwords are those that has been already cracked, we can try to analyse what tools the initial hacker used to crack (rainbow tables? user/login informations?), and maybe we can produce a kind of profile of him (and maybe more, but then it won't be the final words on LinkedIn leak ;-) )
    My 2 cents.

    ReplyDelete
  2. Fascinating read. Are you still working on the 10% you have not cracked or are you sharing them for other analysis?

    Also, that 6 GB password file you have - public anywhere? ;)

    ReplyDelete
  3. Francois: i do have more masked hashes cracked than unmasked, but also realize that the masked hashes were the majority of the hashes, and the majority of the hashes -- both masked and unmasked -- were easy to crack. but, i still have a fair amount of masked hashes left to crack (about 200k.)

    i will also add that i have a little bit of inside information on the subject; although i won't dive into particulars, i will state that a source whom i trust that was close to the incident informed me that dwdm (the one who originally posted both the linkedin and eharmony hashes on insidepro's forum) was paid to crack those hashes, and decided to crowdsource them instead of cracking them himself. so that throws another wrench in the 'masked hashes were already cracked' theory.

    Anonymous: yeah i'm still toying with the remaining hashes. been brainstorming with atom (lead hashcat dev) and trying out some new ideas.

    as far as that 6GB wordlist goes, some parts of it are public, sure. the rest is pretty much compromised of almost every password i've ever personally cracked from various sites. skullsecurity has some good public lists to get you started.

    ReplyDelete
  4. C'mon guys, the lesson is: let the site choose the password for the user!
    It's trivial and even easier to implement

    ReplyDelete
  5. Gianluca, I wouldn't be very happy if every site/service I used told me I had to use a certain password, and that I couldn't change it. I think SAML or OAuth are more elegant long-term solutions. Even better if the identity provider offers two-factor auth.

    ReplyDelete
  6. @blazer2x on twitter just asked the following question: "I just ran our list through color analysis, seems quite different to your published one. Is yours case sensitive?"

    No, it's not case sensitive. However, I'm curious as to how you ran your list through color analysis, because counting colors is a really difficult task.

    First, there are lots of words that contain the name of a color that have little to do with the color at all, at least from a psychological standpoint: words like whitesox, redwings, bluejay, greenacres, purplehaze, greyhound, etc.

    Second, there are far too many words that contain the letters r-e-d in them: fired, hired, adored, fred, bored, etc etc.

    So you see, you can't just do e.g. "grep -ci red plains" because that would be wildly inaccurate.

    So then how did we get an accurate count? First I ran the plains through my "baseword extraction algorithm" (sounds sexy, but it's really just a long sed script), and then I counted the number of times each color appears as a baseword /by itself/, e.g. "grep -c '^blue$' basewords"

    I then did some manual analysis -- which made me want to pull my hair out -- to see if there were any that I missed, and added a few stragglers to the count.

    So that's how I counted the colors, which I'm fairly confident is much more accurate than some other methods out there. I'd be interested to hear if there's a flaw in my method though!

    ReplyDelete
  7. I would make public the remaining 10% because the passwords in this list would be statistically important compared to the hashes cracked using your existing word lists.

    ReplyDelete
  8. @Jeremi Gosney

    Thanks for the clear explanation. I was purely counting all instances of colors which obviously is inaccurate.

    It makes sense that you are only counting base words. Though I have a question, how are words such as 'fireryred' or say color+name 'redjim' treated as?

    Suggestion: You may also want to count gray with its alternate spelling 'grey'. US/UK spelling differences I believe.

    ReplyDelete
  9. @blazer2x, words such as 'fireyred' and 'redjim' were both omitted in the method i used. and yes, i did count both gray and grey in the count for gray :)

    ReplyDelete
  10. Is it common for an attacker to only be able to do SQL injections?

    Because there is a simple solution to SQL injection only attacks. Hash then encrypt the password. You can then just store the key in a file that only the web server can read. Then even if the database gets dumped the attacker would need to have access to the file system or be able to dump code to find the key.

    If it is of interest to anyone here's an implementation of it in PHP: http://www.tobtu.com/encryptbcrypt.php

    ReplyDelete
  11. Hi Steve! Nice to see you are secretly watching us. :-)

    I don't know if you are on Twitter as well, but the discussion has started. Initial problem; if an attacker uses SQLi through your application that automatically decrypts the bcrypt data upon extraction from the DB, you really don't add much security to the mix, if any at all. Please see @skradel @pdp11hacker @klingsen @troyhunt @chrismckee @securityninja :-)

    ReplyDelete
  12. If you're already using bcrypt, scrypt, or PBKDF2 and teaching your developers to prevent SQL injection, then I don't see the point of using a local parameter (key, salt) on the web server as a further hedge against SQL injection. If your tolerance for cracked passwords is that low, then you should consider offloading authentication to a separate server and/or a hardware security module (HSM) so that the local parameter is never accessible to the web server. This will provide additional protection against offline password cracking regardless of the type of access the attacker is able to obtain to the web server.

    ReplyDelete
  13. Steve, "only SQL injection" is kind of an odd thing to say. You can leverage SQL injection to do much more than just dump the database. In the best (worst?) scenario, you can use it to execute arbitrary code or get a shell. And if you can execute arbitrary code as the httpd user then you have the exact same privileges as the httpd, making reversible encryption utterly worthless.

    I also concur with Steven Alexander's comment, which goes along with the SDLC risk management process I mentioned toward the end of the article.

    ReplyDelete
  14. Wow never knew that was possible:
    http://security.stackexchange.com/questions/6919/levraging-a-shell-from-sql-injection

    So it's more or less pointless to encrypt.

    ReplyDelete
  15. This remind me the XKCD's password strength : http://xkcd.com/936/ and his explanations : http://ask.metafilter.com/193052/Oh-Randall-you-do-confound-me-so#2779020

    Lots of people wrote he was right, like : Steve Gibson from the Security Now podcast did a lot of work in this arena and found that this password "D0g....................." is harder to break than this password "PrXyc.N(n4k77#L!eVdAfp9". http://www.explainxkcd.com/2011/08/10/password-strength/

    Others differ, like on this interesting thread, mentioning http://rumkin.com/tools/password/passchk.php, but which fail to let people know wether XKCD's right, or not : How accurate is this XKCD comic from August 10, 2011? http://security.stackexchange.com/questions/6095/xkcd-936-short-complex-password-or-long-dictionary-passphrase

    What do you think about this entropy stuff ? is it better to create non-existing terms, or a long password based on non-common sentences/words ?

    ReplyDelete
  16. Basically - passwords are no good anymore. If your provider doesn't have some kind of two-factor protection available, hassle them to get it!

    ReplyDelete
  17. A handful of people also went as far as to use passwords like “linkedinpass” or “thisismylinkedinpassword.” I’ll take one guess at what their bank account password is.

    I use that kind of passwords for sites like linkedin and others I don't consider a very big problem if it's hacked. For forums I often use a simple password of just a couple letters, linkedin a bit higher security, facebook a level up, bank and financial stuff a level up from that again. Using difficult passwords is troublesome, so I only use it when I have to, but when it's important, I do use it.

    ReplyDelete
  18. NEW RULE: Sites should not allow the use of their site name or other common base words in user’s passwords.

    If I'm forced to use difficult passwords on forum sites like twitter, I end up registering a new account every time I want to comment something.

    ReplyDelete
  19. Think of a song/poem/quote, take the first letters of the words and replace the characters in h@x0r 5p3@k... no online strength checker will ever tell you the true "crackability" of your password.

    Thing is you need to change the passwords every three months or less so when you have logins for 5+ sites/services you're pretty much screwed... so maybe using a digital signature of sorts will be more secure as in 4096 string stored on a smart card ... but what happens when you lose it ?

    I guess what I'm trying to say is keep it reasonable. If anyone put their mind to cracking your password "GOD" will not protect you ... or any other password for that matter ;)

    ReplyDelete

All comments will be moderated, primarily for spam. You are welcome to disagree with my posts of course.