TEXT ANALYTICS METHODOLOGIES AND FRAMEWORKS

5 min readJun 16, 2021

1. INTRODUCTION

Text Analysis is a method developed to extract information from textual data that may be readily understood by machine. These methods are developed in order to generate a structured data from a textual data. There are multiple methods for textual analysis such as Frequency Analysis, Latent Semantic Analysis, Topic Modelling, Word Embeddings, Neural Networks, etc.

2. WORD FREQUENCY ANALYSIS

Frequency Analysis is a measure of the number of times a word occurs in a given set of documents. There are multiple mathematical models available for such analysis, such as, counting raw frequencies, bag-of-words model, term frequency-inverse document frequency model (tf-idf) and Co-occurrence matrix (Srivastav et al., 2020).

Such techniques are useful in many aspects. They may be used to establish the signature of a particular author in literature. They may be combined with a synonym dictionary to develop synonyms software such as Thesaurus. These techniques are heavily used in plagiarism detection software. The World Wide web uses these techniques on a great level. They are used by search engines to establish the subject of web pages (Tello, n.d.).

These techniques are used for light weight textual analysis and to come up with a rough idea of the contents of a document. However, for an in-depth analysis, much better techniques are needed.

3. WORD ASSOCIATION ANALYSIS / WORD EMBEDDINGS

Word Embeddings refer to a method of modelling in which words with similar meaning may be grouped together and represented simultaneously. The words may be thought of having an essential impact on the learning algorithms of complex Natural Language Processing are grouped together, or represented in a manner that makes the association particularly clear to the algorithm (Browniee, 2019).

These algorithms are particularly helpful in almost every Machine Learning task involving texts or Natural Language Processing tasks. Large corpuses are converted into Word embedding models for further analysis. They may be used for analysing survey reports, analysing verbatim comments or recommendation systems involving textual data such as music, book or video (Gupta, 2019).

Some of the most common Word Embeddings models are:

I. Word2Vec- It is a framework based on semantic learning making use of a neural network comprising of two layers. It makes use of Continuous Bag-of-Words and Skip-grams models. These are particularly used to obtain actionable information from customer reviews (Gupta, 2019).

II. GloVe — Global Vectors is an unsupervised learning Machine Learning algorithm making use of local context windows and global matrix factorisation. It was developed at Stanford University (Srivastav et al., 2020).

III. fastText — It is an algorithm developed by facebook’s Artificial Intelligence team combining the best features of some of the most popular concepts of Natural Language Processing. It is particularly helpful in text classification and sentiment analysis (Srivastav et al., 2020).

Word Embedding models though have come a long way to represent the best combinations of words in mathematical format, yet fail to represent the relations between words or sentences. They fail to differentiate the meaning of sentence with similar words written in exclamation or excitement or sorrow.

4. LATENT SEMANTIC ANALYSIS

Latent Semantic Analysis is an algorithm that deploys Singular Value Decomposition to go through unstructured data and discovers relationship between concepts and terms.

Latent Semantic Analysis is primarily an information retrieval system particularly used for document categorisation automatically, searching key concepts and Search Engine Optimisation (Latent Semantic Analysis (LSA), n.d.).

Latent Semantic Analysis is however incapable of polysemy, the practice of capturing multiple meanings of a word.

5. TOPIC MODELLING

Topic Modelling is a practice of identification of topics automatically from a textual object and to uncover patterns that are hidden in a text corpus (Bansal, 2016).

These methods are particularly useful in large textual data organization such as social media profiles, e-mails or customer reviews, feature selection, retrieval of information from unstructured data and clustering of documents. It is also helpful inn recommendation engines and application tracking systems (Bansal, 2016).

Topic Modelling also suffers from the problem of polysemy, i.e., it is incapable of capturing multiple meanings of a word.

6. NEURAL NETWORK FRAMEWORKS

Traditional models fail to capture the ordering of words and the impact it may have on the outcome.

Ex: “Cat Walk” and “Walk Cat”

These two phrases may be represented similarly in many conventional Word Embedding frameworks but carry two very different meanings. The following frameworks have been very helpful in such cases:

I. Recurrent Neural Networks are a solution to such problems. These networks allow a model to take into consideration the previous words in a sentence and the impact it may have on the meaning of coming words. This is particularly helpful in many text analysis tasks such as Parts-of-Speech tagging (a word may be a verb or a noun depending upon context), Named Entity Recognition and Sentiment Analysis (Srivastav et al., 2020).

II. Recursive Neural Networks or Tree Neural Networks are a special case of Recurrent Neural Networks that use the concept of a recursive tree structure that helps in solving various Natural Language Processing problems.

These structures are helpful in representing relations between elements that have large distances between them. Hence, they find great applications in customer review segmentation and Sentiment Analysis (Srivastav et al., 2020).

III. Seq2sed Models- Sequence to sequence models are used to generate a sequence of outputs based upon a sequence of inputs. They are also referred to as a dual Recurrent Neural Network Model or an Encoder Decoder model.

These models have found a variety of applications. They are however mostly used in translation tasks (translating one language to another), chatbots and question-answering (Srivastav et al., 2020).

7. CONCLUSION

The models discussed here have their own use cases and are helpful in different situations in their own unique way. They have been designed keeping in mind a particular set of problem statements and the situations where they would be helpful. Further research is being carried out and new methodologies are being developed to overcome the problems in existing frameworks.

REFERENCES

Bansal, S. (2016). Beginners Guide to Topic Modeling in Python. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/

Browniee, J. (2019). What are Word Embeddings for Text. Machine Learning Mastery. https://machinelearningmastery.com/what-are-word-embeddings/

Gupta, S. (2019). Word Embeddings in NLP and its Applications. KDNuggets. https://www.kdnuggets.com/2019/02/word-embeddings-nlp-applications.html

Latent Semantic Analysis (LSA). (n.d.). Market Muse. Retrieved May 29, 2021, from https://blog.marketmuse.com/latent-semantic-analysis-definition/

Srivastav, A., Khan, H., & Mishra, A. K. (2020). Advances in Computational Linguistics and Text Processing Frameworks. 217–244. https://doi.org/10.4018/978-1-7998-2772-6.ch012

Tello, J. (n.d.). Word Frequency Analysis: A Method to Improve Your Writing. Freelance Writing. Retrieved May 29, 2021, from https://www.freelancewriting.com/copywriting/word-frequency-analysis-improve-writing/

TEXT ANALYTICS METHODOLOGIES AND FRAMEWORKS

Written by Ayush Srivastav