home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!sun-barr!ames!agate!agate!usenet
- From: stolfi@src.dec.com (Jorge Stolfi)
- Newsgroups: comp.archives
- Subject: [sci.lang] Natural language wordlists available
- Followup-To: comp.editors
- Date: 4 Sep 1992 07:52:41 GMT
- Organization: DEC Systems Research Center
- Lines: 209
- Approved: adam@soda.berkeley.edu
- Distribution: world
- Message-ID: <1874k9INN1vl@agate.berkeley.edu>
- References: <1992Aug24.212359.6906@src.dec.com>
- NNTP-Posting-Host: soda.berkeley.edu
- Summary: A collection of inflected wordlists for several natural
- X-Original-Newsgroups: sci.lang,comp.text,comp.editors
- X-Original-Date: 24 Aug 92 21:23:59 GMT
-
- Archive-name: auto/sci.lang/Natural-language-wordlists-available
-
-
- A few months ago I collected half a dozen natural language wordlists by
- anonymous ftp from various places around the world. I spent a couple
- of weeks regularizing the spelling conventions, merging the lists into
- one per language, and removing various junk. The result is now
- available through anonymous FTP from
-
- site: gatekeeper.dec.com
- directory: pub/misc/stolfi-wordlists
-
- The package's README file (below) describes the wordlists and how to
- unpack them. Each wordlist has its separate README file, which gives
- more details on the source of the words, spelling conventions, etc.
-
- Note: To save space and transmission time, they were crunched with a
- non-standard crunching algorithm specialized for sorted word lists, and
- then piped though the Unix utilities "tar" "compress". The package
- includes an un-crunching program (in Unix C).
-
-
- PACKAGE: DEC SRC Collection of Public Domain Wordlists
- VERSION: DEC-SRC-92-04-05
-
- EDITOR
-
- Jorge Stolfi
-
- (Until 92-Aug-24:)
- DEC Systems Research Center
- 130 Lytton Avenue, Palo Alto CA 94301
- Phone: [USA] (415) 853-2226
-
- (After 92-Aug-24:)
- DCC - Universidade Estadual de Campinas
- Caixa Postal 6065
- 13081 Campinas, SP - Brazil
- Phone: +55 (192) 39 8442
- E-Mail: <stolfi@dcc.unicamp.ansp.br>
-
- DESCRIPTION
-
- This package contains wordlists for several natural languages, which
- may be useful for linguistic research and text-processing
- applications such as spelling checkers. They were compiled from
- several publicly available files that I obtained by anonymous FTP
- from various sites around the world.
-
- All the credit for this package should go to the authors of the
- original lists, who did all the actual work and made the results
- available for public use. In particular, I wish to thank
-
- Anders Ellefsrud
- Andy Tannenbaum
- Arjan de Vet
- Barry Brachman
- Dan Klein
- David Vincenzetti
- Edward Vielmetti
- Geoff Kuenning
- H. Morrow Long
- Hans Bouw
- Henk Smit
- Martien Kuunders
- Neal Dalton
- Paul Stravers
- Stefan Kutsch
- Walt Buehring
- Werner Icking
-
- My role here is only one of editor and collector: basically, I
- feteched, compared, and merged the files, trying to uniformize the
- spelling conventions for each language (such as the encoding of
- accents) and removing obvious typos, non-words, and "foreign" words.
- My efforts were necessarily limited by my ignorance of most of these
- languages, and by the limited time I could spen in the cleanup (a
- couple of days for each list).
-
- (NON-)COPYRIGHT STATUS
-
- To the best of my knowledge, all the files I used to build these
- wordlists were available for public distribution and use, at least
- for non-commercial purposes. I have confirmed this assumption with
- the authors of the lists, whenever they were known.
-
- Therefore, it is safe to assume that the wordlists in this package
- can also be freely copied, distributed, modified, and used for
- personal, educational, and research purposes. (Use of these files in
- commercial products may require written permission from DEC and/or
- the authors of the original lists.)
-
- Whenever you distribute any of these wordlists, please distribute
- also the accompanying README file. If you distribute a modified
- copy of one of these wordlists, please include the original README
- file with a note explaining your modifications. Your users will
- surely appreciate that.
-
- (NO-)WARRANTY DISCLAIMER
-
- These files, like the original wordlists on which they are based,
- are still very incomplete, uneven, and inconsitent, and probably
- contain many errors. They are offered "as is" without any warranty
- of correctness or fitness for any particular purpose. Neither I nor
- my employer can be held responsible for any losses or damages that
- may result from their use.
-
- FILES
-
- This package contains the following sub-directories and files:
-
- README
-
- This file.
-
- unpack-words
-
- A shell script that unpacks the compressed archives (see below).
-
- expanddict.c
-
- A C program used by "unpack-words" to expand the wordlists
- after extracting them from the archive file.
-
- dutch/
-
- FILE CONTENTS WORDS BYTES
- --------------------- --------------------- -------- ---------
- dutch.words words & names 189249 2137557
- dutch.trash rejected material 1910 14802
- dutch.maybe unclassified 43059 579746
-
- english/
-
- FILE CONTENTS WORDS BYTES
- --------------------- --------------------- -------- ---------
- english.words plain words 104206 1055781
- english.names proper names 6186 54253
- org.names organizations 154 1229
- computer.names computer orgs & prods 88 676
- misc.names "foreign" names 2020 16470
- english.abbrs abbreviations 671 3086
- english.trash rejected material 3123 25210
- english.maybe unclassified 5906 57754
-
- german/
-
- FILE CONTENTS WORDS BYTES
- --------------------- --------------------- -------- ---------
- german.words words & names 174537 2260942
- german.trash rejected material 2187 16240
-
- italian/
-
- FILE CONTENTS WORDS BYTES
- --------------------- --------------------- -------- ---------
- italian.words plain words 61183 573117
- italian.trash rejected material 1867 14816
-
- norwegian/
-
- FILE CONTENTS WORDS BYTES
- --------------------- --------------------- -------- ---------
- norwegian.words plain words 61839 598547
- norwegian.trash rejected material 5 30
-
- swedish/
-
- FILE CONTENTS WORDS BYTES
- --------------------- --------------------- -------- ---------
- swedish.words words & names 14944 126412
- swedish.trash rejected material 8744 79217
-
- In general,
-
- foo/foo.words is the main wordlist for language foo.
-
- foo/foo.trash is a collection of "words" from the original
- wordlists that I believe are either invalid
- or do not belong in foo.words.
-
- For more details, consult the README files in each sub-directory.
-
- COMPRESSED ARCHIVES
-
- The package is available for public FTP as a set of compressed "tar"
- files, one per language:
-
- -rw-r--r-- 500455 Aug 2 03:17 dutch.tar.Z
- -rw-r--r-- 246631 Aug 2 03:13 english.tar.Z
- -rw-r--r-- 374418 Aug 2 03:15 german.tar.Z
- -rw-r--r-- 68797 Aug 2 03:13 italian.tar.Z
- -rw-r--r-- 123145 Aug 2 03:18 norwegian.tar.Z
- -rw-r--r-- 61743 Aug 2 03:17 swedish.tar.Z
-
- To recover the wordlists, copy the "unpack-words" script
- and any subset of the archives above to a clean working directory;
- then do "unpack-words <language>" for each language.
-
- The script should uncompress and un-tar the archive,
- compile the 'expanddict.c" filter (included in archive),
- and pipe each wordlist through it.
-
- I would greatly appreciate any additions corrections to the
- wordlists or to the accompanying documentation, and any pointers to
- additional publicly available wordlists.
-
- --jorge
-
-