home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
ftp.itri.bton.ac.uk
/
2015-02-03.ftp.itri.bton.ac.uk.tar
/
ftp.itri.bton.ac.uk
/
bnc
/
lemma.doc
< prev
next >
Wrap
Text File
|
1997-12-22
|
4KB
|
97 lines
Lemmatised BNC frequency list available
=======================================
Adam Kilgarriff
University of Brighton
30th May 1996
Following various requests, particularly form workers in EFL, I have
prepared a lemmatised frequency list from the BNC. This is a
single list giving word frequencies for the 6,318 words with more than 800
occurrences in the whole 100M-word BNC. The definition of a 'word'
approximates to a headword in an EFL dictionary such as Longman's
Dictionary of Contemporary English: so, eg, nominal and verbal "help"
are listed separately, and the count for verbal "help" is the sum of
counts for verbal 'help', 'helps', 'helping', 'helped'.
The list is available over the net in directory
ftp://ftp.itri.bton.ac.uk/pub/bnc/
The lemmatised list is called 'lemma' and is available in four forms:
ordered alphabetically or by frequency, and compressed (using gzip) or
uncompressed, so the four files are:
lemma.al (124 KB)
lemma.al.gz (55 KB)
lemma.num (124 KB)
lemma.num.gz (55 KB)
The format for the list is:
sort-order, frequency, word, word-class
and a sample from the top of the alphabetically-ordered list is:
5 2186369 a det
2107 4249 abandon v
5204 1110 abbey n
966 10468 ability n
321 30454 able a
The list-creation process replicated that used at Longman for marking
dictionary frequencies in LDOCE 3rd edition, a process described in
Kilgarriff, A. Putting Frequencies in the Dictionary.
International Journal of Lexicography (to appear). Available
electronically (gzipped postscript) as:
ftp://ftp.itri.bton.ac.uk/pub/bnc/ijl.ps.gz
Numbers, names, and items that would usually be capitalised are
excluded. Only simple words (eg containing no spaces) were
considered. The following set of word classes is used:
conj (conjunction) 34 items
adv (adverb) 427
v (verb) 1281
det (determiner) 47
pron (pronoun) 46
interjection 13
a (adjective) 1124
n (noun) 3262
prep (preposition) 71
modal 12
infinitive-marker 1
A word like "right" has four list entries, for adjective, adverb,
interjection and noun. (Just ten words have more than three list
entries.)
Unlike the Longman list, only the BNC was used (so the lists only
reflect British, not American, frequencies); spoken and written
frequencies are not separated; spelling variants are not counted as a
single word; manual checking was less extensive.
The raw lists, from which the lemmatised list was generated (and which
are, consequently, a less theory-dependent form of data), are also
available from the same web site: see
ftp://ftp.itri.bton.ac.uk/pub/bnc/README
This file is available as ftp://ftp.itri.bton.ac.uk/pub/bnc/lemma.doc
The work was undertaken under EPSRC grant 'SEAL'.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Adam Kilgarriff
Research Fellow tel: (44) 1273 642919
Information Technology Research Institute (44) 1273 642900
University of Brighton fax: (44) 1273 642908
Lewes Road
Brighton BN2 4AT email: Adam.Kilgarriff@itri.bton.ac.uk
UK http://www.itri.bton.ac.uk/~Adam.Kilgarriff
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%