Opened 5 years ago

Closed 5 years ago

#2 closed enhancement (fixed)

Generate Data for Initial Languages

Reported by: rsmudge Owned by: rsmudge
Priority: major Milestone: Multi-Lingual AtD
Component: AtD Server Keywords:
Cc:

Description

Generate corpus data, models, and dictionary words for the initial set of languages Multi-Lingual AtD will support.

The process for the corpus data is documented at:

http://blog.afterthedeadline.com/2009/12/04/generating-a-plain-text-corpus-from-wikipedia/

Dictionary wordlists can be derived from OO.org dictionaries using the unmunch utility.

Models can be generated by doing:

mkdir atd/lang/LANG_ID
mkdir atd/lang/LANG_ID/models
mkdir atd/lang/LANG_ID/corpus
mkdir atd/lang/LANG_ID/wordlists

Place the corpus data and wordlists into the appropriate directories.

Optionally, create unigrams for the entire language and extract the unigrams occurring 50 or more times. Add this data to the wordlists folder. This can help fill the dictionary out with more words and capture proper nouns:

java -Datd.lang=fr -Dfile.encoding=UTF-8 -Xmx3840M -XX:+AggressiveHeap -XX:+UseParallelGC -jar lib/sleep.jar
utils/bigrams/buildunigrams.sl lang/fr/corpus lang/fr/models/unigrams.bin

java -Dfile.encoding=UTF-8 -Xmx2536M -XX:NewSize=512M -jar lib/sleep.jar utils/bigrams/builddict.sl 25 lang/fr/models/unigrams.bin
lang/fr/wordlists/frdict.txt

Finally, build the AtD models:

rm -f lang/fr/models/model.bin
java -Datd.lang=fr -Dfile.encoding=UTF-8 -Xmx3840M -XX:+AggressiveHeap -XX:+UseParallelGC -jar
lib/sleep.jar utils/bigrams/buildcorpus.sl lang/fr/corpus lang/fr/models/model.bin lang/fr/wordlists

java -Dfile.encoding=UTF-8 -Xmx2536M -XX:NewSize=512M -jar lib/sleep.jar utils/bigrams/builddict.sl 2 lang/fr/models/model.bin
lang/fr/models/dictionary.txt

Change History (1)

comment:1 Changed 5 years ago by rsmudge

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.