home *** CD-ROM | disk | FTP | other *** search
- Xref: sparky sci.crypt:3801 alt.security:4571
- Path: sparky!uunet!think.com!barmar
- From: barmar@think.com (Barry Margolin)
- Newsgroups: sci.crypt,alt.security
- Subject: Re: Letter Frequency
- Date: 15 Oct 1992 16:35:03 GMT
- Organization: Thinking Machines Corporation, Cambridge MA, USA
- Lines: 51
- Message-ID: <1bk6jnINNenm@early-bird.think.com>
- References: <1big1qINNrnq@matt.ksu.ksu.edu> <1992Oct15.140918.27296@emr1.emr.ca>
- NNTP-Posting-Host: telecaster.think.com
-
- In article <1992Oct15.140918.27296@emr1.emr.ca> nyelle@ccrs.emr.ca (Norman Yelle) writes:
- >If you have an on-line copy of the dictionnary, then you can do the
- >following:
- >
- > grep -i e /usr/dict/words | wc
- >
- >... to find how many words contain the letter 'e'. You can do this for all
- >26 letters. This is the results I got with a dictionary of 25144 words:
-
- That's not a good way to determine the frequency of letters in actual text,
- which is what you generally want for cryptanalysis, because it gives
- equivalent weight to letters in words that are used with different
- frequencies, and doesn't give extra weight to letters that appear multiple
- times in a word. Here's your list sorted by frequency:
-
- e 14835
- a 13190
- r 11533
- i 11428
- t 10599
- o 10285
- n 10256
- s 8700
- l 8625
- c 7204
- u 5989
- d 5324
- m 5320
- p 4929
- h 4844
- b 3862
- g 3756
- y 3523
- f 2342
- w 1907
- v 1848
- k 1848
- x 616
- j 427
- z 380
- q 377
-
- Notice how different it is from ETAOIN SHRDLU, the beginning of the usual
- letter frequency list. That's probably mostly because of the extreme
- frequency of the word "the" in text; it drags T and H several notches
- higher.
- --
- Barry Margolin
- System Manager, Thinking Machines Corp.
-
- barmar@think.com {uunet,harvard}!think!barmar
-