пятница, 12 апреля 2019 г.

How to make Stardict dictionaries from tatoeba.org


Tatoeba.org is a collection of sentences and translations. For example, you know Russian and English; the other knows English and French. There is an English sentence. You can translate it into Russian; the other can translate into French. Then people who want to know translation the sentence from French to Russian (or from Russian to French) will see translation of the sentence. They do not need to know English. At this moment, tatoeba.org has more than 4 million sentences, of which 594 thousand ones is English. Problems with using tatoeba.org site are following:
  1. Sometimes you can to not have access to internet.
  2. When you have internet sometimes one http request may take 20 seconds. It is very long time.
  3. Tatoeba.org site has a problem with tree searching related with different group level. Thus, these dictionaries will contain more sentences than tatoeba.org site.
Therefore, I have written several programs, which help to transform csv files (from here) to Stardict dictionary format. I made all of that using Arch Linux.

First: build or download SQLite3 database with sentences

It is necessary to obtain special database because the database contains indexes, which make handling with sentences faster by orders. In order to build the database use SQLToebaLite. Besides that, you can download prepared database from here. You can use both tatoeba.db and tatoebaLite.db databases but tatoebaLite.db is less than tatoeba.db. Therefore, you download tatoebaLite.db because I have limited bandwidth in hosting :).

Second: make tab file

Tab file is a file containing two columns separated by tabulation character. First column is original of sentence; second one is translation. In order to make tab file use tatoebaToStarDict (github). Check line likeconn = sqlite3.connect('tatoeba.db'):

Folder containing this script must containing file with name tatoeba.db (correct this string in script or rename file if they are not equal). Using tatoebaToStarDict.py :
  1. python3 tatoebaToStarDict.py 5 eng deu > TatoebaEngToGer201605
This command will make TatoebaEngToGer201605 tab file which contains translations of English sentences to German. Number ‘5‘ means maximal group level of searching (tatoeba.org site has 2nd maximal group level of searching). So I recommend to use 5th maximal group level of searching. The program considers that a sentence may contain several translation of another language. You can find short codes of languages (‘eng’, ‘deu’ for example) here.

Third: compile Stardict dictionary using stardict-editor

I downloaded stadict-tools package from here. After installing the package run stardict-editor. Open tab file then press “compile” button.

After that, it will be created three files (.dict .idx .ifo) which can be opened by stardict or goldendict.

Result

In result, it is possible to search translations from tatoeba sentences faster than using site (less than 1 second), on a computer or on a mobile phone.

The compiled dictionaries you can find here. If you do not want to download or build the database for creating two or three dictionaries you can request me, from time to time I will share compiled dictionaries. If you have a question, you can post a comment.

0 коммент.:

Отправить комментарий