I’m going to give you two definitions of latent semantic indexing. The reason is that LSI is derived from a mathematical formula used to retrieve data and was originally used in universities to make searching large databases of information more accurate. The first definition will give you an explanation of LSI and LSA (latent semantic analysis) from an educational perspective. The second will be in accordance with how search engines (primarily Google) are using LSI in their search engine algorithm to produce their search results.
Latent semantic analysis (LSA) is a theory and a method to extract and represent the meaning of contextual use of words through statistical calculations applied to a large corpus of text. The underlying idea is that the sum of all the word contexts in which a given word appears and does not appear provides a set of mutual constraints that largely determines the similarity of meaning of words and word sets to each other. The appropriateness of the LSA reflection of human knowledge has been established in a number of ways. For example, their scores overlap those of humans on standard vocabulary and subject tests; mimics human word classification and classification judgments; simulates word-word and passage-word lexical raw data; and, as reported, accurately estimates the coherence of passages, learns the capacity of passages by individual students, and the quality and quantity of knowledge contained in an essay.
LSA can be interpreted in two ways:
(1) simply as a practical expedient to obtain rough estimates of the substitutability of contextual use of words in larger text segments, and of the types of meaning similarities, not yet fully specified, between words and text segments that such relationships may reflect, or
(2) as a model of computational processes and representations underlying substantial parts of the acquisition and use of knowledge. We sketch both views below.
Common keyword searches approach a collection of documents with a kind of accounting mindset: a document contains or does not contain a certain word, with no middle ground. We create a set of results by examining each document for certain keywords and phrases, discarding the documents that do not contain them, and sorting the rest according to some classification system. Each document is judged on its own by the search algorithm: there is no interdependence of any kind between documents, which are evaluated solely on the basis of their content.
Latent semantic indexing adds an important step to the document indexing process. In addition to recording which keywords a document contains, the method examines the collection of documents as a whole, to see what other documents contain some of those same words. LSI considers that documents that have many words in common are semantically close and those that have few words in common are semantically distant. This simple method correlates surprisingly well with how a human, looking at content, could classify a collection of documents. Although the LSI algorithm doesn’t understand anything about the meaning of words, the patterns it sees can make it seem surprisingly smart.
When you search an LSI-indexed database, the search engine looks at the similarity values it has calculated for each content word and returns the documents that it thinks best fit the query. Because two documents can be semantically very close even if they don’t share a particular keyword, LSI does not require an exact match to return useful results. When a simple keyword search will fail if there is no exact match, LSI will often return relevant documents that do not contain the keyword at all.
To use an example above, let’s say we use LSI to index our collection of math articles. If the words n-dimensional, multiple, and topology appear together in enough articles, the search algorithm will notice that the three terms are semantically close. Therefore, an n-dimensional manifold search will return a set of articles that contain that phrase (the same result we would get with a normal search), but also articles that contain only the word topology. The search engine understands no math at all, but looking at a sufficient number of documents teaches it that all three terms are related. It then uses that information to provide an expanded set of results with better recall than a simple keyword search.