Opened 4 years ago
Closed 3 years ago
#2 closed enhancement (fixed)
Generate Data for Initial Languages
| Reported by: | rsmudge | Owned by: | rsmudge |
|---|---|---|---|
| Priority: | major | Milestone: | Multi-Lingual AtD |
| Component: | AtD Server | Keywords: | |
| Cc: |
Description
Generate corpus data, models, and dictionary words for the initial set of languages Multi-Lingual AtD will support.
The process for the corpus data is documented at:
http://blog.afterthedeadline.com/2009/12/04/generating-a-plain-text-corpus-from-wikipedia/
Dictionary wordlists can be derived from OO.org dictionaries using the unmunch utility.
Models can be generated by doing:
mkdir atd/lang/LANG_ID
mkdir atd/lang/LANG_ID/models
mkdir atd/lang/LANG_ID/corpus
mkdir atd/lang/LANG_ID/wordlists
Place the corpus data and wordlists into the appropriate directories.
Optionally, create unigrams for the entire language and extract the unigrams occurring 50 or more times. Add this data to the wordlists folder. This can help fill the dictionary out with more words and capture proper nouns:
java -Datd.lang=fr -Dfile.encoding=UTF-8 -Xmx3840M -XX:+AggressiveHeap? -XX:+UseParallelGC -jar lib/sleep.jar
utils/bigrams/buildunigrams.sl lang/fr/corpus lang/fr/models/unigrams.bin
java -Dfile.encoding=UTF-8 -Xmx2536M -XX:NewSize=512M -jar lib/sleep.jar utils/bigrams/builddict.sl 25 lang/fr/models/unigrams.bin
lang/fr/wordlists/frdict.txt
Finally, build the AtD models:
rm -f lang/fr/models/model.bin
java -Datd.lang=fr -Dfile.encoding=UTF-8 -Xmx3840M -XX:+AggressiveHeap? -XX:+UseParallelGC -jar
lib/sleep.jar utils/bigrams/buildcorpus.sl lang/fr/corpus lang/fr/models/model.bin lang/fr/wordlists
java -Dfile.encoding=UTF-8 -Xmx2536M -XX:NewSize=512M -jar lib/sleep.jar utils/bigrams/builddict.sl 2 lang/fr/models/model.bin
lang/fr/models/dictionary.txt
Change History (1)
comment:1 Changed 3 years ago by rsmudge
- Resolution set to fixed
- Status changed from new to closed
