Tatoeba.org is a free collaborative online database of example sentences geared towards foreign language learners. Its name comes from the Japanese term "tatoeba" (??? tatoeba), meaning "for example". Unlike other online dictionaries, which focus on words, Tatoeba focuses on translation of complete sentences. In addition, the structure of the database and interface emphasize one-to-many relationships. Not only can a sentence have multiple translations within a single language, but its translations into all languages are readily visible, as are indirect translations that involve a chain of stepwise links from one language to another.
Video Tatoeba
The aim of the project
The aim of the Tatoeba Project is to create a database of sentences and translations that can be used by anyone developing a language learning application. The idea is that the project creates the data, so programmers can just focus on coding the application.
The data collected by the project is freely available under a Creative Commons Attribution (CC-BY) license.
Maps Tatoeba
Content
As of November 2017, the Tatoeba Corpus has over 6,000,000 sentences in 319 languages. The top 21 languages make up 90% of the corpus. Eighty-five of these languages have over 1,000 sentences. The top 13 languages have over 100,000 sentences each. The interface is available in 25 different languages.
Tatoeba.org is also the current home of the Tanaka Corpus, a public-domain series of about 150,000 English-Japanese sentence pairs compiled by Hyogo University professor Yasuhito Tanaka first released in 2001, and where it is undergoing its latest revisions.
The actual statistic of all languages are found at [1]
History
Tatoeba was founded by Trang Ho in 2006. She originally hosted the project on Sourceforge under the project name "multilangdict".
Interface
Users, even those who are not registered, can search for words in any language to retrieve sentences that use them. Each sentence in the Tatoeba database is displayed next to its translations in other languages; direct and indirect translations are differentiated. Sentences are tagged for content such as subject matter, dialect, or vulgarity; they also each have individual comment threads to facilitate feedback and corrections from other users and cultural notes. As of early 2016, more than 200,000 sentences in 19 languages had audio readings. Sentences can also be browsed by language, tag, or audio.
Registered users can add new sentences or translate or proofread existing ones, even if their target language is not their native tongue. However, it is preferred that users translate into their native or "strongest" language and add sentences from their native language rather than translating into or adding from their target language. Translations are linked to the original sentence automatically. Users can freely edit their own sentences, "adopt" and correct sentences without an owner, and comment on others' sentences. Advanced contributors, a rank above ordinary contributors, can tag, link, and unlink sentences. Corpus maintainers, a rank above advanced contributors, can untag and delete sentences. They can also modify owned sentences, though they typically do so only if the owner fails to respond to a request to make the change.
Database structure
Tatoeba's basic data structure is a series of nodes and links. Each sentence is a node; each link bridges two sentences with the same meaning.
License
The entire Tatoeba database is published under a Creative Commons Attribution 2.0 license, freeing it for academic and other use.
Grants
Tatoeba received a grant from Mozilla Drumbeat in December 2010.
Some work on the Tatoeba infrastructure was sponsored by Google Summer of Code.
Usage
Parallel text corpora such as Tatoeba are used for a variety of natural language processing tasks such as machine translation. The Tatoeba data has been used as data for treebanking Japanese and statistical machine translation, as well as the WWWJDIC Japanese-English dictionary and the Bilingual Sentence Pairs and Japanese Reading and Translation Practice on www.ManyThings.org.
Offline edition
Selected content from Tatoeba - 83,932 phrases in Esperanto along with all their translations into other languages - has appeared in the third edition of the multilingual DVD Esperanto Elektronike ("Electronic Esperanto") published in 6,000 copies by E@I in July 2011.
Tab-delimited data ready for import into Anki and similar software can be downloaded from http://www.manythings.org/anki/
See also
- Phrase book
- List of linguistic example sentences
References
External links
- English Tatoeba homepage
- (Youtube) Video presenting the key ideas behind the Tatoeba Project
Source of article : Wikipedia