Hokkien (aka Taigi, Taiwanese Minnan) exists in multiple variants and scripts. For building translation models, we find it is best to have training data in:
* Note: Even though Traditional Chinese Script produces the best results for translation models, we still recommend that the final output displayed to end-users be in the Latin script by default. This is because many international Hokkien users can’t read the Chinese script, whilst all Hokkien communities can read the latin script. Tools exist to automatically convert Hokkien in Traditional Chinese script to a Hokkien Latin script.
Resource | Approx. Size | Description | Languages | Hokkien Scripts | Commercial Use | License Notes |
---|---|---|---|---|---|---|
Our Hokkien Data | 200 | Super Clean, High Quality, Commercializable Hokkien Data | Hokkien, English, Mandarin* | Traditional Chinese Script, Latin Script | Yes | Open-ended Commercial Use License |
Tatoeba | 15 | A database of translations from volunteer contributors. | Hokkien, English, Mandarin, Cantonese | Latin Script, Chinese Script | Yes | CC-BY |
Wiki Travel: Minnan Phrasebook | 150 | Phrasebook containing basic words and phrases. Data shows inline variants and requires data cleaning. | Hokkien, English | Traditional Chinese Script, Latin Script | Difficult (due to SA) | CC-by-SA (Share Alike) |
Hokkien Wikipedia | 350,000 | A Hokkien Wikipedia written in POJ. 350K sentences estimated from 27M characters. | Hokkien Only | Latin Script (POJ) | Difficult (due to SA) | CC-by-SA (Share Alike) |
Omniglot | 60 | An online encyclopedia of languages. | Hokkien, English | Traditional Chinese Script, Latin Script | No | Copyright © Omniglot |
MOE Dict | 16,000 | A Hokkien dictionary created by Taiwan gov | Hokkien, English, Mandarin | Traditional Chinese Script, Mix Script, Latin Script (Tailo) | No | CC BY-ND (No Derivatives) |
Taiwanese Across Taiwan Corpus (TAT) | 1,400 | TAT contains Text pairs in addition to audio recordings. | Hokkien, English | Traditional Chinese Script, Mix Script, Latin Script (Tailo, POJ) | No | CC BY-NC |
Taiwanese Min Nan Media (e.g. Movies, Shows, & Songs) | unknown | Hokkien Media can sometimes contain Hokkien Subtitles. Caution required to ensure subtitles aren't in Mandarin or Standard Written Chinese. | Hokkien, English, Mandarin | Latin Script, Traditional Chinese Script | No | Rights belong to respective owners. |
Commercializability Notes:
The difficulty of finding commercializable datasets is one of the reasons we're publishing dedicated Hokkien datasets:
All our Hokkien Datasets have:
* Our dataset is uniquely tailored for ML model development, fine-tuning, and evaluation. We pre-select source sentences to encompass a wide range of general-domain topics and sentence structures, ensuring coverage of various grammatical concepts that ML Models typically require additional training data to master.
Check out our Hokkien Data here