[Next] [Up] [Previous] [Contents]
Next: 6. How Aspell Works Up: Aspell .26.2 alpha A Previous: 4. Library Interface   Contents

Subsections

5. International Support

Even though Aspell is designed around the English language Aspell will do OK with other non-English languages provided that it doesn't have an extremely large dictionary (say over a megabyte of two in size) or have a lot of affication (to the point where affix compression will shrink the size over 50%). If the language has a large dictionary or a lot affication Aspell will work but it will take up a lot space due to the way Aspell indexes the words (see 6) and the fact that Aspell currently lacks any sort of affix compression (seeB.7.1 ).

Support for other language can either be added at run time through a language data file or at compile time.

5.1 At Run Time

Languages can be added at will through the use of a language data file. The file name must be in the same directory that the word list(s) are and it must be named <language>.lang where <language> is the name of the language you are added support for.

5.1.1 Format of the language data files

The data file consists of three blocks of information inclosed in braces. Any information out side of the braces is ignored. The white space before and after the braces is mandatory.

5.1.1.1 The Case Block

The first block of information contains the upper to lower case mapping. It conceits of lower/upper case pairs of letters with white space between them. For example here is the case mapping for English:

{ aA bB cC dD eE fF gG hH iI jJ kK lL mM nN oO pP qQ rR sS tT uU vV wW xX yY zZ }
If a character is the same in both upper and lower case than repeat twice such as ``kk''. Failure to do so will result in an error. Also as I said before the white space before and after the braces ({ })is mandatory.

5.1.1.2 The Vowel Block

The second block of information contains a list of the vowel or vowel like characters in lower case. For example the second block for English would be:

{ a e i o u y }

5.1.1.3 The Other Characters Block

The last block of information contains a list of other characters which are not part of the alphabet but can nevertheless appear within a valid word. The final block for English would be:

{ ' }

5.1.2 A complete example

For you reference here is what the complete english.lang file looks like

Language File for english

Case Block { aA bB cC dD eE fF gG hH iI jJ kK lL mM nN oO pP qQ rR sS tT uU vV wW xX yY zZ }

Vowel Block { a e i o u y }

Other Characters Block { ' }

5.1.3 Finishing Up

Once you created the data file you need to pass the dictionary through ``aspell master'' to properly prepare it using the new language. Now just make sure the word list and the language data file are in the same directory.

Once you have used the new language for a while please consider sending me a copy of the data file so that I can include it in future versions.

5.2 At Compile Time

More complete support for a language can be added by writing some code and recompiling the source file. In order to do this you should have the latest version of automake, autoconf, and libtools installed as the Makefile is going to need to be recreated.

  
5.2.1 Getting Started

The easiest way to get started is to write the language data file first and use the aspell utility to create most of the code for you. The usage is

aspell lang [<path>]<lang>
Where <path> is the optional fully qualified directory name of the location of the language data file and <lang> is the name of the lang. There should be no space between <path> and <lang>.

This will create to file asl_<lang>.hh and asl_<lang>.cc containing all the code you need to compile in support for the language. However, in order to get Aspell to recognize the new language you need to modify the file language.cc in two places. You need to include the the asl_<lang>.hh file and you need to add the language to the lookup static variable. The line to add should look like this:

lookup_pair("<lang>", new_SC_<Lang>)
Where <Lang> is <lang> with the first letter capitalized. For example to add support for a French language you would say:

lookup_pair(``french'', new_SC_French)
This line can go anywhere in the table however I recommend that you add it after the last entry. Just be sure you remember the list still has all the necessary commas.

Finally you need to add the file asl_<lang>.cc to the end of the libspell_la_SOURCES variable in Makefile.am and then type make. All the necessary files re be recreated automatically provided that you have the proper tools installed.

Once you have successfully used the compiled in language you can start experimenting with fine tuning it by overriding virtual methods in the SC_Language class.

5.2.2 The SC_Language Class

The SC_Language class is the base class for language support all language class must be derived from this class.

5.2.2.1 Synophis

class SC_Language {  
protected:  
  enum CasePattern {all_lower, first_upper, all_upper};  
  static const char consonant = 1, vowel = 2, special = 3; 
   
  const char *name_; 
  const char *to_lower_;  
  const char *to_upper_;  
  const char *is_alpha_;  
  const char *soundslike_chars_;  
   
  SC_Language() {} 
  
public:  
  virtual ~SC_Language() {}  
  
  virtual string to_soundslike(const const_string &word) const;  
  virtual int case_pattern (const const_string &word) const;  
  virtual string fix_case (int pattern, const const_string &word) const;  
  virtual bool trim_n_try (const aspell &sc, const const_string &word) const; 
  virtual bool have_phoneme() const; 
  virtual string to_phoneme(const const_string &word) const; 
  
private: 
  SC_Language(const SC_Language&);  
  const SC_Language operator= (const SC_Language&); 
  
public:  
  char to_upper(const char c) const 
  char to_lower(const char c) const 
  bool is_upper(const char c) const  
  bool is_lower(const char c) const  
  bool is_special(const char c) cons

  // other irrelevant non virtual methods 
};

5.2.2.2 The protected members

All of the protected data members must be given a value by the derived class as the public methods relay on them. The ``aspell lang'' utility will take care of this for you so for most cases you don't need to worry about them.

5.2.2.2.1 const char *name_

This data members needs to point to a null terminated string containing the name of the current language.

5.2.2.2.2 const char *to_lower_

This data member needs to point to a 256 character long character array which maps the upper case characters to the lower case. A static_cast<unsigned char> is performed on the character before it is looked up so that a signed value of -1 would become 128. If the character c is an upper case character than to_lower_[static_cast<unsigned char>(c)] needs to contain c in lower case. If c is not in upper case then it needs to contain c.

5.2.2.2.3 const char *to_upper_

The same as to_lower but it maps lower case characters to upper case.

5.2.2.2.4 const char *is_alpha_

Similar to to_lower_ and to_upper_ except that is_alpha_[static_cast<unsigned char>(c)] need to be false (0) if c is a non-word character and true (anything but 0) otherwise.

In addition if the to_soundslike method is not overridden c needs to be SC_Language::consonant if c is a consonant, and SC_Language::vowel is c is a vowel. If the trim_n_try method is not overridden c needs to be SC_Language::special if c is a non-alpha characters that can appear as part of the word, such as the appophes (') in english.

5.2.2.2.5 const char *soundslike_chars_

Needs to contain a null terminated array of characters which contains all of the characters that can appear in a to_soundslike string. If the to_soundslike method is not overridden this will be all the lower case consonant.

5.2.2.3 Virtual Destructor

The destructor must be defined if your class uses any dramatically allocated memory as the SC_Language class destructor does not delete anything.

5.2.2.4 Virtual Public Members

These methods only have to be overridden if you are unhappy with job they do. const_string is a very limited version of the string class. It has an iterator and can be used like a random access container however it doesn't have any of the fancy string methods such as find and substr.

5.2.2.4.1 string to_soundslike(const const_string &word) const

This method needs to return a string which represents what the word roughly sounds like.

5.2.2.4.2 string to_phoneme(const const_string &word) const

This method needs to return a string which represents the phoneme for the word.

5.2.2.4.3 bool have_phoneme() const

Needs to return true if the to_phoneme method is overloaded.

5.2.2.4.4 int case_pattern (const const_string &word) const

This method needs to study the string and return an integer which represents the case pattern (such as all uppercase, first letter uppercase, etc..)

5.2.2.4.5 string fix_case (int pattern, const const_string &word) const

This method needs to fix the case of word so that it has the same case pattern as pattern and return the new word.

5.2.2.4.6 bool trim_n_try (const aspell &sc, const const_string &word) const

This method should try to trim special characters (such as the apposhes in english) from the word and then see if it is a valid word. If it can find a valid word by trimming it should return true. Otherwise it should return false.

To avoid infinite recursion this methods should not call aspell::check as aspell::check calls this method. Use aspell::check_notrim instead (aspell:check_raw should not be used as it doesn't not try to change the case of the word thus 'Do' would come back false)

5.2.2.5 Private Members

Both the copy constructor and the assignment operator are private so that you don't have to worry about copies being made.

5.2.2.6 Public Members


[Next] [Up] [Previous] [Contents]
Next: 6. How Aspell Works Up: Aspell .26.2 alpha A Previous: 4. Library Interface   Contents

1999-01-04