Hokkien (also known as Minnan, Taiwanese Hokkien, Taigi) is used by around 50 million speakers worldwide, more than Polish. It is relied upon throughout much of Southeast Asia; such as Singapore, Taiwan, Malaysia, China, and the Philippines. It is the most popular Min language. Like most Chinese languages, Hokkien is not mutually intelligible with other Chinese languages.
Hokkien is arguably the most important Chinese language for global companies after Mandarin Chinese and Cantonese Chinese. This is due to Hokkien’s international reach and speakers’ reliance on the language compared to other Chinese languages like Wu (Shanghainese). Because of its international reach, Hokkien is often found written in the Latin script, as not all Hokkien speakers understand the Chinese script.
Using English-Hokkien Translation as an example, the following Hokkien translation pipeline produces the best results:
Most variants of Hokkien are mutually intelligible, but differences in grammar and vocabulary can occur by region. Whilst Taiwanese Hokkien is the most stable and recognized variant of Hokkien, Hokkien variants are also found outside of Taiwan. This includes Malaysia, Singapore, the Philippines, Hong Kong, Indonesia, Cambodia, China, Myanmar, Thailand, Brunei, and Vietnam.
There are 3 main ways of writing Hokkien:
A standardized Traditional Chinese Script character set for Hokkien was released by Taiwan in 2007. Outside of Taiwan, Hokkien written in the Traditional Chinese Script remains non-standardized and can vary by community.
There are two types of Hokkien Latin Scripts: Tâi-lô and Pe̍h-ōe-jī (POJ). POJ is older but has more historic resources. Tailo is newer and officially adopted in Taiwan. Tailo is found in standardized uses of Hokkien such as in education and government. Both scripts are mutually intelligible, and can be converted between each other.
From a technical point of view, using Traditional Chinese Script as an intermediary helps produce better Hokkien translations. However, because a significant amount of the international Hokkien diaspora is not familiar with Chinese scripts, Latin Script is usually preferred when reading Hokkien. As such, to reach the most end-users, we recommend ensuring that the Latin Script is available. Translators can also choose to offer multiple scripts, for example; Microsoft offers Arabic in both Arabic and Latin/Roman scripts, whilst Google offers Crimean Tatar in both Latin and Cyrillic scripts).
As of this writing, there are several ways to convert Hokkien text to speech (i.e. Hokkien TTS). Each of these come with their own set of considerations:
Hokkien Text-to-Speech | Rank* | Considerations |
---|---|---|
ITUAN (Taigi) | 1 |
|
Facebook MMS (Chinese, Min Nan) | 2 |
|
Microsoft (Southern Min) | 3 |
|
*Ranked by closeness to MoeDict Hokkien Human Recordings using the TTS tool’s optimal format (whether chinese script, mixed script, or latin script). 1=Best
Training a language translation model boils down to two steps:
Most Hokkien datasets generally include English or Mandarin (in Traditional Chinese Script), as well as Hokkien in different written scripts (Traditional Chinese Script, Mixed Script, POJ Latin Script, Tailo Latin Script).
Click here for our list of [Hokkien Translation Datasets].
⚠️ Note: Most Hokkien Translation data is restricted to non-commercial use. If you are building a commercial translator check out our high quality Hokkien dataset that’s licensed for commercial use.
In 2025, one of the easiest ways to train a language translation model is to fine tune a large language model (LLM). OpenAI makes finetuning their base models very easy (see Finetuning OpenAI Models ). Meanwhile, with a bit more technical skill, one you finetune other models like LLAMA or do additional finetuning on existing Hokkien translation models (⚠️ Note: Existing Hokkien Translation Models are generally not trained on commercializable data and can not be used for commercial purposes).
We’ve presented various considerations when building a Hokkien translator. This ranges from the Written Scripts in the translation pipeline, to Hokkien variants and Hokkien Text-to-Speech. You should also consider if the translation model will be for non-commercial or commercial. If the translator is for commercial purposes, then clean high-quality commercial-use data is a necessity.
We look forward to seeing your Hokkien Translator out in the world soon!