Abstract:
The complexity to get relevant information for a user is very high due to
increasing rate of text over the internet. To address these issues, more study has been
conducted when information is gathered and text analytics, and it is the most popular
research area in terms of extracting kewyords. There are many types of data regarding
to the observations and analysis such as graphical data and others. The user may also
produce data by using social media, Wikipedia, or any other resource. Most of the
people generate their own data by Twitter (social media, considered as one of the most
popular platforms for crawling the short text, because it contains 140 characters per
tweet).
Keyword extraction is a process where a text is givento the computer and the
computer return a set of keywordsthat recommended topical words and phrases from
the contentof documents. Keyword extractionhelps the reader to understand the
summary or at least the coreidea of the document without reading the whole document.
Asa result, the prospect readers do not waste their valuable timesreading the irrelevant
documents comprehensively. Generaly, by searching the keywords, users could find
related posts toan event. Keyword extraction methods are being appliedto many areas
especially when we extract keywords in the areaof information retrieval. This has a
particular interest becausepeople retrieve significant information based on keywords.
In this thesis, we have used agraph-based keyword extraction algorithm over four
different datasets collected from Twitter on different terms. By the preprocessing of
datasets through NLTK we will get more optimized data, and the co-occurrence graph also generated by this dataset. Moreover, we have alsoshown whether the study of co-
occurrences allows keeping track of the structure of each text, however, it is more
tedious to handle and often leads to messy visualizations.
There are many libraries there for visualization, python is giving more
reliability for plotting because it provides many built-in libraries. TextRank algorithm
is a graph-based keyword extraction algorithm, it follows the Google PageRank
algorithm but somehow it is different from that by the words and links. TextRank
calculates the score of every relevant word and by that score, we can find more
important words of the corpus, further, it also finds the precision of those relevant
words. Word cloud is also enhancing its popularity by the visualization, by its different
look there are many word clouds are present over the internet.
The genuine data set, crawled from Twitter, provides the data for the
experimental assessment of the proposed work.