Accepted Papers

  • Machine Translation Evaluation In SNS in Terms Of User-Centered Orientation
    Kim Euna, Department of English Language & Literature, Busan, South Korea
    This Study explores the role of machine translation by creating a corpus of text from the one of SNS, Instagram, and analysing and evaluating the corpus data in terms of User-Cantered Translation (UCT). For the data to examine, Reuter’s Instagram account with language pair of English and Korean was selected due to the fact that the posts are open to the public and use formal structure of sentences. Based on the corpus, questionnaire was made to actually see the response from users who are following the Reuter’s account and using translation function.
  • Nazm And Computational Stylometry For Ten Arabic Travel Texts
    Ahmed Omer1 and Michael Oakes2, 1& 2University of Wolverhampton, England
    Computational Stylometry is the computer analysis of writing style. Successful techniques for computational stylometry characterise the texts under study by large numbers of linguistic features, such as the frequencies of word, character, or part of speech sequences contained in them. The degree of stylistic difference between a pair of documents can then be found by any of a number of measures which compare the sets of linguistic features for each document. In general, the technique is to first find a set of linguistic features and a difference measure which successfully discriminates between texts known to be either by author A or author B. Then texts of unknown authorship are compared against these texts to see whether their writing style is more similar to author A or author B. In this paper we compare character pairs, word pairs and part of speech pairs as linguistic features, and use Ward’s method [4] as the difference measure and linkage method. We use a distance matrix based on Burrows’ delta method [14] to display the results. The part of speech pairs are of particular interest for Arabic texts, as they represent the “nazm” proposed by the Arabic scholar Al-Jurjani as long ago as the 11th Century CE as an indicator of individual writing style [1]. Our test bed is a corpus consisting of ten books written by Egyptian travellers written between 1854 and 1930, and we show that we are able to discriminate between samples taken from these books using techniques of computer stylometry. The linguistic feature which best discriminated between the texts, especially when working with small text samples, was “nazm”.
  • Automated Cyber Hate Detection Using Natural Language Processing And Machine Learning
    Shruti Agarwal, Prakhar Dev Gupta, Vaibhav Khandelwal and Dr. Ajay Kumar, Information Technology, ABV-IIITM, India
    With the increased usage of social media, the activities such as cyberbullying and spread of hatred have also increased. Hence it becomes essential to keep it free from hatred and offensive remarks. A key challenge in this field is the separation of hate from offensive language. This work uses morphological as well as syntactic analysis to separate instances of hate from those with merely offensive language and further classifies the content if it contains an element of sexism, racism or other kind of hate or offense. Multi-class classifiers have been trained for the purpose. We create an ensemble of heterogeneous classifiers to distinguish the classification of text as either Hateful, Offensive but not hateful or Neither. Further using the second dataset we classify whether the content is sexist, racist or neither of them. The final comparison is shown using the values of accuracy, precision, and recall.
  • Wazn: Morphological Analysis for Authorship Studies in Arabic
    Ahmed Omer1 and Michael Oakes2, 1& 2University of Wolverhampton, England
    In this paper we make use of a system for encoding morphologically analysed words in Arabic, where a 1 is used for an original character and 0 for a character which occurs in a grammatical affix. These “wazn” word encodings are used as features for automatic author discrimination by hierarchical cluster analysis. It was found that the three different feature sets of wazn, arud (an encoding of the metrical system of Arabic poetry), and the frequencies of the most frequent words all gave perfect discrimination between a set of ten Egyptian travel writers when the texts were of 2000 words in length, but only the wazn-based encoding could achieve this for shorter texts of only 500 words. Encouraging results were also obtained when the wazn encoding was used for English texts in the same genre.
  • Hierarchical RNN for Information Extraction from Lawsuit Documents
    Rao Xia and Ke Zhenxingb aJurtech, bMaurer School of Law, Indiana University Bloomington, China
    Every lawsuit document contains the information about the party’s claim, court’s analysis, decision and others, and all of this information are helpful to understand the case better and predict the judge’s decision on similar case in the future. However, the extraction of these information from the document is difficult because the language is too complicated and sentences varied at length. We treat this problem as a task of sequence labeling, and this paper presents the first research to extract relevant information from the civil lawsuit document in China with the hierarchical RNN framework.
  • Comparison of Common Part-of-Speech Tagging Techniques Applied to Waray-waray Text
    Fernando E. Quiroz, Jr.1 and Robert R. Roxas2, 1Naval State University, Philippines, 2University of the Philippines – Cebu, Gorordo
    This paper presents the result of comparing common Part-of-Speech tagging techniques applied to the Waray-waray language. Experiment involved the testing of the manually tagged Waray-waray corpus applying the four commonly used state-of-the-art Part-of-Speech tagging algorithms namely: N-gram, TnT, Naïve Bayes, and Brill taggers. The experiment showed that the TnT tagger is most promising for the Waray-waray language.
  • Syntactic Analysis of Compound Sentences
    Sanjeev Kumar Sharma, DAV University, India
    This research paper is an attempt to develop a syntactic analysis system for compound sentences of Punjabi language. Sentence simplification approach has been used for reduction of compound sentences into simple sentences and then these simple sentences are analyzed for syntactical error. A full form lexicon based morph, HMM based POS tagger and set of rules has been used for identification of grammatical mistakes. Various type of grammatical error covered in this research are various agreement errors, order of word class and style errors. An attempt has been done to rectify these errors.
    Ibrahim Abu El-khair, Umm Al-Qura University, Iran
    This study is an attempt to build a contemporary linguistic corpus for Arabic language. The corpus produced, is a text corpus includes more than five million newspaper articles. It contains over a billion and a half words in total, out of which, there is about three million unique words. The data were collected from newspaper articles in ten major news sources from eight Arabic countries, over a period of fourteen years. The corpus was encoded with two types of encoding, namely: UTF-8, and Windows CP-1256. Also it was marked with two mark-up languages, namely: SGML, and XML.
    Shashank Rammoorthy and Amy Rukea Stempel, Stonehill International, India
    The purpose of this study is to assess the effect of seemingly subtle differences in terminology and language on public perception, and to analyze how terminology is often an indicator of subjectivity. Twitter feeds and news articles from popular news websites were analyzed - however, these were not restricted to mainstream sources, keeping the recent “fake news” phenomena in mind. More generally, sentiment analysis techniques were used to analyze how the differential usage of the terms ‘(Syrian) migrant’ and ‘refugee(s)’ with relation to the Syrian Civil War and the influx into Europe correlated with the polarity in texts.
    Yunsil Jo, Pusan National University, South Korea
    This paper discusses the features of documentary translation for dubbing and translation strategies for this kind of audiovisual genre. Especially it aims to analyze differences in the use of pronouns between source text and target text by making use of parallel corpus of English documentary scripts and their Korean translated versions. It is argued that these differences and translation strategies might be attributed to the viewers’ expectancy described in Chesterman’s norm theory.