home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
ftp.itri.bton.ac.uk
/
2015-02-03.ftp.itri.bton.ac.uk.tar
/
ftp.itri.bton.ac.uk
/
bnc
/
variances.doc
< prev
next >
Wrap
Text File
|
1996-03-14
|
3KB
|
83 lines
The variance of common words of English: a BNC-based resource
=============================================================
Documentation for ftp.itri.bton.ac.uk/pub/bnc/variances
Adam Kilgarriff
15 March 1996
RATIONALE
It has long been noted that corpus frequencies, taken alone, give a
very limited picture of a word's distribution in a corpus. As well as
varying in raw frequency, words vary in the extent to which they are
equally spread across the documents on the corpus. This 'burstiness'
can be measured in a variety of ways (Church and Gale, "Poisson
Mixtures", JNLE 1(2), 1996). One straightforward possibility is to
take a large number of documents, all of the same length; count the
frequency of a word in each of these documents; and calculate the
(mean and) variance of this frequency.
The file presents the results of such an exercise. It is potentially
of interest for various statistical approaches to text processing
(e.g. as author identification and information retrieval) as well as
for linguistic studies of how much semantic content different English
words have.
METHOD
I took the first 5,000 words of all documents (=files) longer than
5,000 words in the written part of the BNC. There were 2018 of these,
so I was working from a subcorpus of slightly over 10M words. (I used
written-only on the premise that the spoken material would be too
different to usefully treat as part of the same population - of
course, one might say this about all sorts of subcorpora, but never
mind.) Then I produced a frequency list for each of these (truncated)
documents. Then, taking the 8189 word-pos pairs occurring 100 times or
more in the sample, I produced a 2018x8189 table giving the frequency
of each word in each document, and calculated, for each word, the mean
and variance.
There were two ways to calculate mean and variance: including the zeros (eg
always dividing by 2018) or excluding them (dividing by the number of
documents the word occurred in). For most purposes, it is the former
that is of interest so this is what I present. The "exclusive" figures
may readily be reconstructed.
FILE FORMAT
Columns are
(1) Word )Using BNC definitions of 'word' and tags
(2) POS-tag ) - see README for details
(3) Total freq (in 10M corpus)
(4) (Truncated) documents that word-pos pair occurs in (out of 2018)
(5) Mean (= Total freq./2018)
(6) Variance
(7) Variance/mean
The last is useful because, for distributions like the normal,
poisson, binomial, variance increases with mean, so, to make the
variance figures comparable for words of different base frequency, it
is necessary to normalise by the mean. This is the figure that shows
that, e.g., pronouns have very high variability, and prepositions, low
(cf. Kucera and Francis 1982).
Words are presented in frequency order. The file is .4MB
(uncompressed) and .1MB (compressed): both forms are available.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Adam Kilgarriff
Research Fellow tel: (44) 1273 642919
Information Technology Research Institute (44) 1273 642900
University of Brighton fax: (44) 1273 642908
Lewes Road
Brighton BN2 4AT email: Adam.Kilgarriff@itri.bton.ac.uk
UK http://www.itri.bton.ac.uk/~Adam.Kilgarriff
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%