Information Extraction and Retrieval
In this course, we covered the foundational and advanced concepts of Information Retrieval (IR), focusing on how systems process and retrieve information efficiently. We started with the basics of Boolean retrieval methods, which serve as the groundwork for more complex systems. The limitations of Boolean models in handling large datasets led us to vector space models, which provide a more dynamic framework for information retrieval.
A significant part of the course was dedicated to index construction, where I learned about various algorithms essential for building efficient search systems. We discussed the Block Sort-Based Indexing (BSBI) algorithm and the Single-Pass In-Memory Indexing (SPIMI), each serving distinct purposes depending on the scale of the data and the system requirements. The integration of these models into distributed systems using MapReduce highlighted the scalability challenges and solutions in large-scale environments.
Compression techniques were another critical area of focus. We examined methods for reducing the storage space required for dictionaries and postings in IR systems. Understanding the balance between compression efficiency and system performance was crucial, especially in contexts where rapid data retrieval is necessary.
Throughout the course, the emphasis was also on the practical applications of these theories in real-world search engines and databases. Learning about the operations of modern search technologies provided insights into their design and functionality.
The course also explored relevance feedback and query expansion, as well as probabilistic information retrieval, which are critical for enhancing search system effectiveness.
We learned how relevance feedback can significantly improve search results by allowing the system to adjust queries based on user-labeled relevant and non-relevant documents. The Rocchio algorithm was highlighted as a prominent method for implementing this feedback. In query expansion, we discussed methods to improve queries with synonyms or related terms to capture broader information needs, enhancing both recall and precision.
The probabilistic approach to information retrieval was introduced, emphasizing the Probability Ranking Principle (PRP) and models like Binary Independence Model (BIM) and BM25. These models use statistical probabilities to estimate the relevance of documents to queries, a method that underpins many of today's sophisticated search engines. We examined how these models make assumptions about term independence and relevance, and discussed their practical implications in designing IR systems.
Throughout the course, the application of these theories to real-world systems highlighted the practical relevance of theoretical knowledge in operational settings. By studying various models and their underlying assumptions, I gained a comprehensive understanding of the complexity involved in designing and improving information retrieval systems.
For this assignment, I chose a chapter from Myths of Babylonia and Assyria from Project Gutenberg because it’s great for NLP tasks. I created a custom list of stopwords and adjusted a few hyperparameters like the learning rate and window size to fine-tune the Word2Vec model for this specific text. The code identifies words similarity based on their word embeddings, revealing connections and context in the text. The initial results are promising, but they indicate that further preprocessing might enhance the model's performance.