KEYWORD EXTRACTION USING CO-OCCURRENCE GRAPH BASED  APPROACH

Veizi, Orald

DSpace Home
→
Epoka University
→
Thesis
→
Master
→
Computer Engineering
→
View Item

KEYWORD EXTRACTION USING CO-OCCURRENCE GRAPH BASED APPROACH

Veizi, Orald

URI: http://dspace.epoka.edu.al/handle/1/2410

Date: 2022-03-07

Abstract:

The complexity to get relevant information for a user is very high due to increasing rate of text over the internet. To address these issues, more study has been conducted when information is gathered and text analytics, and it is the most popular research area in terms of extracting kewyords. There are many types of data regarding to the observations and analysis such as graphical data and others. The user may also produce data by using social media, Wikipedia, or any other resource. Most of the people generate their own data by Twitter (social media, considered as one of the most popular platforms for crawling the short text, because it contains 140 characters per tweet). Keyword extraction is a process where a text is givento the computer and the computer return a set of keywordsthat recommended topical words and phrases from the contentof documents. Keyword extractionhelps the reader to understand the summary or at least the coreidea of the document without reading the whole document. Asa result, the prospect readers do not waste their valuable timesreading the irrelevant documents comprehensively. Generaly, by searching the keywords, users could find related posts toan event. Keyword extraction methods are being appliedto many areas especially when we extract keywords in the areaof information retrieval. This has a particular interest becausepeople retrieve significant information based on keywords. In this thesis, we have used agraph-based keyword extraction algorithm over four different datasets collected from Twitter on different terms. By the preprocessing of datasets through NLTK we will get more optimized data, and the co-occurrence graph also generated by this dataset. Moreover, we have alsoshown whether the study of co- occurrences allows keeping track of the structure of each text, however, it is more tedious to handle and often leads to messy visualizations. There are many libraries there for visualization, python is giving more reliability for plotting because it provides many built-in libraries. TextRank algorithm is a graph-based keyword extraction algorithm, it follows the Google PageRank algorithm but somehow it is different from that by the words and links. TextRank calculates the score of every relevant word and by that score, we can find more important words of the corpus, further, it also finds the precision of those relevant words. Word cloud is also enhancing its popularity by the visualization, by its different look there are many word clouds are present over the internet. The genuine data set, crawled from Twitter, provides the data for the experimental assessment of the proposed work.

Show full item record