April 2025

Where to find Hokkien translation datasets

Recommended Hokkien Attributes

Hokkien (aka Taigi, Taiwanese Minnan) exists in multiple variants and scripts. For building translation models, we find it is best to have training data in:

* Note: Even though Traditional Chinese Script produces the best results for translation models, we still recommend that the final output displayed to end-users be in the Latin script by default. This is because many international Hokkien users can’t read the Chinese script, whilst all Hokkien communities can read the latin script. Tools exist to automatically convert Hokkien in Traditional Chinese script to a Hokkien Latin script.

Popular Hokkien Data Sources and their Licenses

Resource Approx. Size Description Languages Hokkien Scripts Commercial Use License Notes
Our Hokkien Data 200 Super Clean, High Quality, Commercializable Hokkien Data Hokkien, English, Mandarin* Traditional Chinese Script, Latin Script Yes Open-ended Commercial Use License
Tatoeba 15 A database of translations from volunteer contributors. Hokkien, English, Mandarin, Cantonese Latin Script, Chinese Script Yes CC-BY
Wiki Travel: Minnan Phrasebook 150 Phrasebook containing basic words and phrases. Data shows inline variants and requires data cleaning. Hokkien, English Traditional Chinese Script, Latin Script Difficult (due to SA) CC-by-SA (Share Alike)
Hokkien Wikipedia 350,000 A Hokkien Wikipedia written in POJ. 350K sentences estimated from 27M characters. Hokkien Only Latin Script (POJ) Difficult (due to SA) CC-by-SA (Share Alike)
Omniglot 60 An online encyclopedia of languages. Hokkien, English Traditional Chinese Script, Latin Script No Copyright © Omniglot
MOE Dict 16,000 A Hokkien dictionary created by Taiwan gov Hokkien, English, Mandarin Traditional Chinese Script, Mix Script, Latin Script (Tailo) No CC BY-ND (No Derivatives)
Taiwanese Across Taiwan Corpus (TAT) 1,400 TAT contains Text pairs in addition to audio recordings. Hokkien, English Traditional Chinese Script, Mix Script, Latin Script (Tailo, POJ) No CC BY-NC
Taiwanese Min Nan Media (e.g. Movies, Shows, & Songs) unknown Hokkien Media can sometimes contain Hokkien Subtitles. Caution required to ensure subtitles aren't in Mandarin or Standard Written Chinese. Hokkien, English, Mandarin Latin Script, Traditional Chinese Script No Rights belong to respective owners.

Commercializability Notes:

Translation Data’s Hokkien Data

The difficulty of finding commercializable datasets is one of the reasons we're publishing dedicated Hokkien datasets:

All our Hokkien Datasets have:

* Our dataset is uniquely tailored for ML model development, fine-tuning, and evaluation. We pre-select source sentences to encompass a wide range of general-domain topics and sentence structures, ensuring coverage of various grammatical concepts that ML Models typically require additional training data to master.

Check out our Hokkien Data here