Abstract:
Digital life, especially after the introduction of Web 2.0, has significantly altered
human relations, providing all people the “right of public speech”. Ideas, emotions,
and opinions on many topics are generously shared in virtual environments. A new age
global and digital Mouth of World is shaping the society where knowledge is the most
influential power. Being fed by social media data highly dynamic in either amount or
shape, automatic handling is indispensable.
Natural Language Processing, in cooperation with Machine Language techniques, has
an important say in analyzing written textual data. Traditional techniques exploited in
the literature are empowered when hybrid ones are applied, in accordance also with the
characteristic properties of the language used and the domain-specific data. Although
all the subsequent steps of the text classification chain are important, adequate feature
selecting has a notable huge impact on accurate classification prediction.
In this study, a simple classification of the sentiment polarity of comments in document
level of subjective texts in Turkish is done. Different domains include reviews of
customers towards company products, movies, and healthcare services, deciding on the
positivity or negativity of the comments. Another domain includes doctors’ notes on
patients’ symptoms aiming to predict and thus recommend some of the most often used
medical tests according to general doctors’ procedures.
The features used included a part of or all distinct words roots together with their
binary or frequency information. Linear or vector analysis of the feature sets was done
employing Machine Learning algorithms provided by the Weka tool. Hybrid features
set was proposed and found more efficient combining binary vectors and frequency
meta-features from nodes and leaves of J48 tree classifier for all or a set of correlation based selected features, improving both prediction accuracy and classification
performance.