Hokkien

The Hokkien Dataset

$ 79.99


Features

  • Clean Data
  • Quality Human Translations
  • Diverse texts optimized for quality Machine Learning
  • Licensed for Commercial Use
  • Immediate Availability
  • Ethical Sourcing
  • Legally Safe Sourcing

Purchase
Overview A text translation dataset with Hokkien, English, and Mandarin.
Size 202 sentences
COLUMNS
English ★ Source Text optimized for Machine Translation with sentences specifically written to cover diverse general domain concepts (embeddings), and exhibit diverse grammatical features.
Hokkien
(Traditional Chinese Script)
★ High Quality Human Translations, using the prestigious Taiwanese Hokkien variant of Hokkien.
Hokkien
(POJ Latin Script)
★ High Quality Human Translations in Pe̍h-ōe-jī. Written in the spoken register (i.e. Text-to-Speech compatible).
Mandarin
(Traditional Chinese Script)
Google Neural Machine Translation from English texts.

License

This dataset is available for creating commercial and non-commercial applications, such as Machine Translation Models and Educational apps, without requiring attribution. Buyers and external parties are also permitted to publish derivative datasets which contain ids from the "id" column - no other original data can be published in derivative datasets. This dataset can not be resold or published as is.


Sample

id English Hokkien (Traditional Chinese) Hokkien (POJ Latin) Mandarin (Traditional Chinese)
D50 The intricate puzzle kept him entertained for hours on end. 彼个精細的拼圖佮伊耗幾若點鐘的時間。 Hiān-khak ê bí-á hō͘ i pīn-chò lâu-lâu bô-khùn. 這個複雜的謎題讓他連續幾個小時都樂此不疲。
D100 Is it ethical for companies to use personal data to influence consumer behavior? 公司使用個人資料來影響消費者行為,這樣做合乎倫理嗎? Kompaniyānnā vyaktigat mahiti vaprun grahakānchī vārtanuk prabhavit karṇe nāitik āhe kā? 公司利用個人資料來影響消費者行為是否合乎道德?
D150 My grandmother's stories are filled with wisdom and humor. 我阿嬤的故事充滿智慧和幽默。 Góa ah-má ê sū-koa uì tshù-hok kap siāu-phìng. 我祖母的故事充滿智慧和幽默。

About the Hokkien Language

Names * Hokkien, Min Nan, Southern Min, Taiwanese, Taigi, Banlam, Quanzhang, 閩南語, 咱儂話, 福建話, 臺灣話, 書語

(* Though Hokkien is technically part of the Southern Min (aka Min Nan) group of languages, the Hokkien language is also sometimes referred to as "Southern Min". This is because Hokkien is the most widely spoken Southern Min language.)

Population 40-50 million
Regions Primarily South-East Asia (Singapore, Taiwan, Malaysia, Philippines, Indonesia, Cambodia, Myanmar, Hong Kong, Thailand, Brunei, Vietnam, China)
BCP 47 IETF Language Code zh-hkm
Glottolog Code hokk1242
ISO 639-3 nan
Highest Resource Variant Taiwanese Hokkien
Written Scripts Traditional Chinese Script, Simplified Chinese Script, Mixed Script (Hanlo), Latin Script (POJ or Tailo)
Recommended Translator Script Traditional Chinese Script (See reasoning)
Recommended User-Facing Script Latin Script (POJ or Tailo) (See reasoning)