DiVoMiner® User Manual

  1. Home
  2. Docs
  3. DiVoMiner® User Manual
  4. Algorithm mining
  5. Language identification

Language identification

Model Explanation

Leveraging advanced automatic language identification technology, this system can identify over 90 languages, including Simplified and Traditional Chinese, English, Korean, Japanese, and French. It also provides detailed statistics on the distribution of languages within the input text.

This algorithm is optimized based on langID. The model employs a multinomial Naive Bayes classifier and is trained on corpora from various scenarios across multiple languages, offering high accuracy and fast execution speed as its key advantages.

Accuracy Explanation

The model was tested using the XNLI dataset (url: https://github.com/facebookresearch/XNLI), which is a corpus collaboratively constructed by researchers from Facebook and New York University to evaluate the multilingual sentence understanding capabilities of models. The latest XLM and Multilingual BERT models utilize XNLI to assess their cross-lingual performance. The test sample comprises 150,000 articles, totaling 9,672,723 characters, with an overall accuracy of 95.8%. However, when the text contains a mixture of multiple languages, the accuracy of recognition may decline.

Reference:

  • Bagci, I. E., & Alhoniemi, E. (2020). Language Identification using transfer learning from large language models pre-trained on non-parallel multilingual databases. Information, 11(10), 468.
  • Ur Rahman, M. A., Ali Khan, F., Okasha, S., & Buya, R. (2021). Deep Language Identification using Stack of Residual and Inverted Residual Convolutional Neural Networks. IEEE Access, 9, 44999-45012.
Was this article helpful? Yes No

How can we help?