Abstract:
One of the crucial points in the text mining studies is the feature hashing step. Most of the text mining studies starts with a text data source and processes a feature extraction methodology over the text. Most of the time the feature extraction method should be decided wisely, because, most of the times, it directly effects the results and performance. Another well-known approach is using any feature extraction method, together with the feature hashing. By the way, the feature extraction can be executed without worrying about the performance and the feature hashing reduces the size of the extracted feature vector.
Today, one of the widely used hashing algorithms in text mining is the modern hashing algorithms like MD5 or SHA1, which are built over substitution permutation networks (SPN) or Fiestel Networks. The common property of most of the modern hashing algorithms is the implicitly implemented s-boxes.
One of the drawbacks of the modern hashing algorithms is the collision free purpose of the algorithm. The permutation step in most of the time is implemented for this purpose and the correlation between the input text and output bits is completely obfuscated.
This study focuses on the possible implementations of the s-boxes for the feature hashing. The purpose feature hashing in this study is reducing the feature vector, while keeping the correlation between the input text and the output bits.