Topic Modeling using NMF and LDA
Topic modeling is a statistical model to discover hidden semantic patterns in unstructured collection of documents. Large collection of documents are represented in terms of topics and topics are represented in terms of words. This Top-Down approach will help in exposing hidden insights from the corpus. In this approach, every document is a distribution of topics and every topic is a distribution of words. The topics extracted using Topic modeling are collection of similar words. The intuition behind Topic modeling is built on top of mathematical framework, which is based on probability and statistics of words in each topic.
Out of all the existing algorithms for topic modeling, Latent Dirichlet Allocation (LDA) and Non-negative matrix factorization (NMF) are extensively used by Data modelers and widely accepted in scientific community for topic extraction. LDA is a probabilistic model and NMF is a matrix factorization and multivariate analysis technique.
The basic idea in topic modeling is to vectorize the given corpus by term frequency or term frequency-inverse document frequency and split that document term matrix into document – topic and topic – word subsets and thereby optimizing subsets either by using probabilistic or factorization techniques.
The challenge and ambiguity involved in Topic modeling is validation. The very approach of extracting topics from large collection of documents itself is unsupervised i.e., documents are not labelled prior modeling. Therefore, validating topics obtained from unsupervised approach is a tedious task. One has come out with their own validation technique depending upon their application. Due to the advent of dimensionality reduction techniques and advanced computational packages, one can visualize the similarity between topics extracted from corpus.
There are numerous applications of Topic modeling. The idea of searching for keywords in corpus can be tremendously enhanced by embedding topic modeling with search engines as topic models can pinpoint relevant words and documents by using a threshold probability distribution. Topic modeling is widely used in advanced research labs in the domain of healthcare, journalism, politics and Law enforcement. Modeling topics helps users in doing targeted research which undoubtedly leads to efficient results.