an overview of latent semantic indexing
An Overview of Latent Semantic Indexing Latent semantic indexing is a technique that projects queries and documents into space with latent semantic dimensions. In the latent semantic space, a query and a document are similar even if they don't share any of the same terms if their terms are semantically similar. LSI is similarly metric to word overlap measures. LSI has fewer dimensions than the original space and is a method for dimensionality reduction. This reduction takes a set of objects that exist in a high-dimensional space and rearranges them and represents them in a lower dimensional space instead. They are often represented in two or three-dimensional space just for the purpose of visualization. Latent Semantic Indexing, or LSI is a mathematical application technique sometimes known as singular value decomposition. The projection into the LSI space is chosen so that the representations in the space of origin are changed as little as possible. Then it is measured by the sum of the squares of the difference. There are several different mappings for latent semantic indexing from high dimensional to low dimensional spaces. LSI chooses the optimal mapping in a sense that minimizes the distance. Choosing the number of dimensions is a unique problem. A reduction can remove much of the noise while keeping too few dimensions may lose important information. LSI performance is improved considerably after ten to twenty dimensions and peaks at seventy to one hundred dimensions. Then it slowly begins to diminish again. There is a pattern of performance that is observed with other datasets as well.