RECENT TOPICS » View all
I found this website through wwwjdic: http://tatoeba.fr/?documentation&id=1&lang=en It's a project aiming to improve and expand upon the Tanaka corpus. I thought that it might be of interest for some people on this forum. While the Tanaka corpus might be flawed, it's now possible to correct those flaws. And people that collect sentence pairs for their own use could submit them in order to benefit everybody.
Last edited by Transparent_Aluminium (2008 November 16, 12:22 pm)
I doubt it would be improved, this is how Tanaka Corpus worked from the start. Students who simply picked sentence pairs. Nothing garanteed the sentence pairs made sense or sounded natural at all. What we need is simply native Japanese people writing native Japanese phrases, then good translators translating them in a litteral sense with footnotes.
It's a very interesting site, but there's always the question of how literal vs. how fluent you want the phrases to sound.
For example, I randomly got this set of sentences:
wwwjdic Discussion Edit 10295 (>>) Logs
en I have gas in my abdomen.
fr J'ai des gaz dans mon abdomen.
jp お腹にガスがたまっています。
In American English, you would normally drop the "in my abdomen" bit, and just know by context (a little hard to do in an Internet posting...) - I don't know about the French and Japanese.
At least with the Japanese, they didn't include 私は! But I wonder if:
ガスがたまっています。
would be sufficient in Japanese to get the meaning across.
The current distribution of sentences is quite good for learners of Japanese:
Number of sentences
English: 151799
Japanese: 151357
French: 23498
German: 1615
Spanish: 1124
...and you can download the sentences.
http://tatoeba.fr/?download-sentences
Last edited by kfmfe04 (2008 November 16, 3:54 pm)
I downloaded that file. I have no idea what to do with these.
Are you supposed to cross-reference the numbers, because that would take an eternity.
Also, some of the English made my head spin, and, you should understand, that is quite an achievement on the translators part. @_@
kazelee wrote:
I downloaded that file. I have no idea what to do with these.
Are you supposed to cross-reference the numbers, because that would take an eternity.
Also, some of the English made my head spin, and, you should understand, that is quite an achievement on the translators part. @_@
Yes, you need to cross-reference. I have a relational database which can do this for us. I can post a file containing just Japanese and English later on, if I can just get rid of all the 文字化け in the file.
Saving it to UNICODE txt in TextEdit doesn't quite do the trick in Vista. If someone manages to get rid of the 文字化け on Mac, XP, Vista, or Linux, plz let me know what you did. Thx.
Last edited by kfmfe04 (2008 November 16, 4:20 pm)
kfmfe04 wrote:
It's a very interesting site, but there's always the question of how literal vs. how fluent you want the phrases to sound.
For example, I randomly got this set of sentences:
wwwjdic Discussion Edit 10295 (>>) Logs
en I have gas in my abdomen.
fr J'ai des gaz dans mon abdomen.
jp お腹にガスがたまっています。
In American English, you would normally drop the "in my abdomen" bit, and just know by context (a little hard to do in an Internet posting...) - I don't know about the French and Japanese.
At least with the Japanese, they didn't include 私は! But I wonder if:
ガスがたまっています。
would be sufficient in Japanese to get the meaning across.
The expression according to 研究社 is "have gas: 腹が張る" or alternatively just "おなら". I think the expression above is just a literalism.
-edit-
Just found it in 研究社, but only in the 和英.
" ?腹にガスがたまる have gas in the bowels"
The usage is pretty medical sounding and it even has "(おなら)" at the end to explain what ガス means in this context to the Japanese reader. The meaning for ガス is overwhelmingly of the combustible or poisonous variety (no stinky fart or lighting your fart jokes please). 腹が張る or おなら are much more natural.
Google fight!
Results 1 - 10 of about 103,000 for "腹が張る"
Results 1 - 10 of about 1,690,000 for "おなら"
Results 1 - 10 of about 815 for "腹にガスがたまる"
Last edited by Jarvik7 (2008 November 16, 4:54 pm)
kfmfe04 wrote:
en I have gas in my abdomen.
fr J'ai des gaz dans mon abdomen.
jp お腹にガスがたまっています。
This is exactly why projects like these are useless. I seriously doubt a Japanese person would say お腹にガスがたまっています. It's just a sentence from one language directly translated to other languages... I don't think I've ever heard an English person say "I have gas in my abdomen." It's just not a normal way of expressing yourself.
Tobberoth wrote:
kfmfe04 wrote:
en I have gas in my abdomen.
fr J'ai des gaz dans mon abdomen.
jp お腹にガスがたまっています。This is exactly why projects like these are useless. I seriously doubt a Japanese person would say お腹にガスがたまっています. It's just a sentence from one language directly translated to other languages... I don't think I've ever heard an English person say "I have gas in my abdomen." It's just not a normal way of expressing yourself.
I'm beginning to agree with you on this.
Perhaps if they had one source language, and other languages as explanations or loose translations of the source language, the quality would be better. For fluency purposes, a loose, but fluent translation would actually more useful than a more literal translation.
Thanks for pointing this out.
Last edited by kfmfe04 (2008 November 16, 4:49 pm)
@ TRoth
Useless is such a strong word. I'm sure there is some use for it.
Last edited by kazelee (2008 November 16, 4:50 pm)
Well, it'll always be good for a laugh. So there's that.
kfmfe04 wrote:
Perhaps if they had one source language, and other languages as explanations or loose translations of the source language, the quality would be better. For fluency purposes, a loose, but fluent translation would actually more useful than a more literal translation.
If the goal was for this to be a valid learning resource, each sentence would need both literal AND natural translations, at a minimum.
Here's a description of the Tanaka Corpus:
http://www.csse.monash.edu.au/~jwb/tanakacorpus.html
The Tatoeba project's intention is good, but the execution is broken. At the very least, there should be:
1. A source language
2. Translations
A sentence for a source language must come from a native source. Translations should be literal; if it's translated at some expense to fluidity, we should forgive, or just post notes.
I was thinking it might be nice for us to specify a "properly useful" project and build a site ourselves, but I wonder if it's really worth the trouble...
1. The gas example I posted may be somewhat of a fluke (not so common?)
2. Whatever Japanese sentence is there, it is probably better than anything I can produce
3. It's there already, so no additional work is required
or, as QuackingShoe commented,
4. It's just good for a laugh
Maybe, in a few years, there will be a de-facto, standard Wikipedia-like aggregation of sentences... ...maybe Tatoeba is it, or it could be a precursor to something.
What are your thoughts?
Last edited by kfmfe04 (2008 November 16, 5:59 pm)
I was thinking it might be nice for us to specify a "properly useful" project and build a site ourselves, but I wonder if it's really worth the trouble...
Cooperation is good and all, but I think a number of people on this site spend far too much time making/compiling study material and not enough time actually studying. I don't really see the use for this kind of sentence pairing database even if it was reliable though. Then again I don't believe in the sentence method ![]()
A sentence for a source language must come from a native source. Translations should be literal; if it's translated at some expense to fluidity, we should forgive, or just post notes.
I think they should be natural if you only had one translation. At least that way you would learn how to express concepts in the target language. having literal translations would gain you nothing other than knowing what a properly formatted (but nonsense) sentence looks like.
Example:
Source: 腹が立つ。
Literal translation: (My) stomach is standing. (properly formatted but unintelligible)
Natural translation: I'm angry.
-opposite direction-
Source: He kicked Bob's ass.
Literal translation: 彼はボッブの尻を蹴った。 (properly formatted and intelligible, but doesn't express the same meaning)
Natural translation: 彼はボッブを遣っ付けた。
Last edited by Jarvik7 (2008 November 16, 6:15 pm)
So what method do you believe in?
I don't follow a method. Normally I just add words to Anki as I encounter them in life either from conversation (few hours daily), media, or my translation studies. I've actually been getting some less than useful words lately thanks to history (eg.郷士) & art studies (eg.引目鉤鼻). My cards are usually just the word on one side and a Japanese or english definition (copy & pasted from a dictionary) on the other - minimal time investment in the materials.
However right now I'm cramming for JLPT so I'm using prep. books & a vocab list. It's not the best way to learn Japanese, but the best way to pass a standardized test is to study for the test. JLPT never was a realistic test of ability anyways.
I say I don't believe in the sentence mining method because I think it's too much of a time investment in the mining as opposed to the learning. Gathering materials vs using them... A lot of the threads have been about mining questionable sources like phrase-books and beginner level textbooks too.
Last edited by Jarvik7 (2008 November 16, 7:36 pm)
So you are in Japan?
I am for about half the time. At the moment I'm in Canada, but I spend most of my time speaking & thinking in Japanese. It's not intentionally AJATT but it just works out that way because of what I study (Japanese translation, literature, and art) and who my friends are.
That leaves something to be envied.
Jarvik7 wrote:
... My cards are usually just the word on one side and a Japanese or english definition (copy & pasted from a dictionary) on the other - minimal time investment in the materials.
I say I don't believe in the sentence mining method because I think it's too much of a time investment in the mining as opposed to the learning. Gathering materials vs using them... A lot of the threads have been about mining questionable sources like phrase-books and beginner level textbooks too.
My cards are usually just a sentence on one side and a Japanese definition of the word(s) I don't know (copy & pasted from a dictionary) on the other - minimal time investment in the materials.
I never set aside time solely to 'mine' sentences, I just pick stuff out when I'm reading/watching Japanese materials, which takes me all of 30-45 seconds to type it and grab a definition.
Of course my earlier sentences took more time to input, but the method can be streamlined to being very little more effort than what you're doing, with a lot more benefits (usage patterns, grammar, etc.).
(I've recently joined Tatoeba project as developper)
In fact It's true that some English sentences sounds a bit weird or so, but one must keep in my mind most of these English sentences come from the tanaka corpus, and everyone know there used to have some mistakes, non-natural expression and so
but tatoeba is a collaborative project, and one of its goal is to permit to anyone to correct this sentences (and as you can Jim Breen, tanaka corpus initiator now used tatoeba to correct sentences ), or propose better translation, or add translations in their own languages
so I agree "as is" the site is not the musthave website, but hey it's collaborative, by reporting mistakes (I've already corrected the sentence with gas, and any registered user can do so), contribute, or just by using it and propose your whishlist, we're sure it will become better in quantity and quality
moreover, those who have visited the website some moths ago, can see we have totally change the design, and we're in active development phase so any suggestion are welcome
Last edited by Sysko (2009 December 22, 5:58 am)
I'm glad to see that this is still alive. I wonder how you interact with the Example amendment form on wwwjdic. I hope that there's some kind of coordination.
Yes don't worry, we have been coordinating this with Jim Breen
He agreed to use Tatoeba as the new platform to take care the sentences on WWWJDIC. He will update his pages to link to Tatoeba in case someone wants to point out a mistake or suggest a correction.
On our side, of course, we will provide him with the latest data from Tatoeba on a regular basis so he can update his data.
@Sysco: Jim Breen didn't initiate the Tanaka corpus. Tanaka did. That's why it's not called the Breen corpus.
Hmm yep sorry for this mistake :shame:
Jarvik7 wrote:
I don't follow a method. Normally I just add words to Anki as I encounter them in life either from conversation (few hours daily), media, or my translation studies. I've actually been getting some less than useful words lately thanks to history (eg.郷士) & art studies (eg.引目鉤鼻). My cards are usually just the word on one side and a Japanese or english definition (copy & pasted from a dictionary) on the other - minimal time investment in the materials.
However right now I'm cramming for JLPT so I'm using prep. books & a vocab list. It's not the best way to learn Japanese, but the best way to pass a standardized test is to study for the test. JLPT never was a realistic test of ability anyways.
I say I don't believe in the sentence mining method because I think it's too much of a time investment in the mining as opposed to the learning. Gathering materials vs using them... A lot of the threads have been about mining questionable sources like phrase-books and beginner level textbooks too.
In fact that's one of the reason of tatoeba project creation, we spend too much time writing sentences which have maybe be written thousands of time by learner all around the world. So after I agree that doing only sentence is not the good method, it must be combine with other things
so Tatoeba permit to distribute the load of work, moreover most of people add in their own native language. And as the project it's collaborative, people can discuss about questionable sentences, and try to find how a native will say. So I think, and I hope, when the project will have a mature community, that the main part will be about browsing the sentences and discussing grammar point between native and learners (that's already the case for some languages) rather than adding sentences, but everything must begin somewhere.
After we can easily making an anki plugin to retrieve example sentences from tatoeba when you add a new card for a word, and synchronising your deck with your list on tatoeba, that way it will be great for both the community who will have new sentences to browse reuse, and for learners who will not have to duplicate contents, and can have feedback about the sentences they've added in anki
moreover an other goal of tatoeba is too keep a trace of "minor languages", for some languages such as shanghainese, malay, It's hard to find ressources, even simple example sentences, so I think have these languages present in tatoeba and then being able to have translation in dozen of languages is something which can really help these languages to be learnt, recognized, and preserved
Last edited by Sysko (2009 December 23, 6:58 am)
I like the idea of the project. I think a major stepping stone would be to assemble a proofread lists.

