Site Logo

QCRI Arabic Language Technologies

Tools & Demos "FARASA"

Farasa (which means “insight” in Arabic), is a fast and accurate text processing toolkit for Arabic text. Farasa consists of the segmentation/tokenization module, POS tagger, Arabic text Diacritizer, and Dependency Parser. We measure the performance of the segmenter in terms of accuracy and efficiency, in two NLP tasks, namely Machine Translation (MT) and Information Retrieval (IR). Farasa outperforms or equalizes state-of-the-art Arabic segmenters (Stanford and MADAMIRA), while being more than one order of magnitude faster.

Farasa segmentation/tokenization module is based on SVM-rank using linear kernels that uses a variety of features and lexicons to rank possible segmentations of a word. The features include: likelihoods of stems, prefixes, suffixes, their combinations; presence in lexicons containing valid stems or named entities; and underlying stem templates.

  • Ahmed Abdelali, Kareem Darwish, Nadir Durrani, Hamdy Mubarak. 2016. Farasa: A Fast and Furious Segmenter for Arabic. NAACL-2016.
  • Kareem Darwish and Hamdy Mubarak. 2016. Farasa: A New Fast and Accurate Arabic Word Segmenter. LREC-2016.
  • Zhang, Yuan, Chengtao Li, Regina Barzilay, and Kareem Darwish. "Randomized Greedy Inference for Joint Segmentation, POS Tagging and Dependency Parsing." In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 42-52. 2015.
  • Kareem Darwish. 2013. Named Entity Recognition using Cross-lingual Resources: Arabic as an Example. ACL-2013.
  • Kareem Darwish, Wei Gao. 2014. Simple Effective Microblog Named Entity Recognition: Arabic as an Example. LREC-2014.
  • Hamdy Mubarak, Kareem Darwish. 2014. "Automatic Correction of Arabic Text: a Cascaded Approach". Proceedings of the EMNLP 2014 Workshop on Arabic Natural Langauge Processing (ANLP).
  • Hamdy Mubarak, Kareem Darwish, Ahmed Abdelali. 2015. "QCRI@QALB-2015 Shared Task:Correction of Arabic Text for Native and Non-Native Speakers’ Errors". Proceedings of the ACL 2015 Second Workshop on Arabic Natural Language Processing.