信息索引导论第四章笔记(英文)
Abstract
In chapter 4, it focuses on how to contrust an inverted index, which is called index contruction or indexing. In terms of the constrain of hardware, we need to introduce new idea to avoid the effect of computer hardware that are relevant for indexing. Firstly, section4.2 introduces an efficient single-machine algorithm, blocked sorted-based indexing. And then, single-pass in-memory indexing has even better scaling properties because it does not thold the vocabulary in memeory. In order to adapt frequent changes, we require dynamic indexing in section4.5. At last, we cover some complicating issues that an arise in indexing, comprising security and indexes for ranked retrieval.
Hardware Basics
Knowledge Foundation
- Access to data in memory is much faster than access to data on disk.
- When doing a disk read or write, it takes a while for the disk head to moveto the part of the disk where the data are located.
- Operating systems generally read and write entire blocks.
- Data transfers from disk to memory are handled by the system bus, not bythe processor.
- Servers used in IR systems typically have several gigabytes (GB) of mainmemory, sometimes tens of GB.
Summary: Influencers of indexing
Index contruction refers to the entirely process of changing a doc to inverted index.
- consider the size of memory, CPU clock cycle;
i.e
if there is a large memory, putting all the docs into memory would be fast to build the inverted index.
- save the data to the memory as much as possible
- consider the time wasted in seeking in disk
Block Sorted-Based Indexing
It is used when the memory cannot store the entire doc.
The specific process is as follows:
- Divide the entire document collections into parts of the same size (each part is exactly a block size) and put it in memory.
- Parse each section into a word term_ID-Doc_ID pair;
in order to make the index building more efficiently, we replace the term item itself with the term ID, and of course we need to maintain a mapping table for the term and the term_ID, then put it in memory; - sort this block’s term_ID-Doc_ID pairs by item_ID, and a temporary posting is built on the Doc_ID corresponding to the same term_Id.
- write to the disk, continue to process the next block
- when all doc collections finishes, The inverted index of each block is merged and finally we get a complete inverted index;
- BSBI Algorithm
All in all, the key steps are No.2 and No.3, so the time complexity is considered with the sorting step.
Time Conplexity
θ(TlogT)
T:an upper bound for the number of items we must sort
Nevertheless, the actual indexing time is usually dominated by the time it takes to parse the documents and to do the final merge.
- Disadvantages.
- Cost to maintain the term_id mapping table
- larger documents collections, larger mapping table
- large mapping data structure is not suitable to put into memory
Sinple pass in memory indexing
A more scalable alternative is single-pass in-memory indexing, which uses terms instead of itermIDs.
Thus, SPIMI can index collections of any size as long as ther is enough disk space available.
The specific process is as follows:
- Split the entire document collections into items- document_ID pairs.
- Process this series of term-document_ID pairs as a data stream.
- In memory, a dictionary is built in advance which can be realized by hash table, therefore, use the term to find the location where document_ID should be placed.
- When the memeory is full, sort dictionary by term, and then write the sorted dictionary-posting to the disk.
- Merge the Multiple sorted dictionary-posting paris.
- SPIMI Algorithm
According to the step 2 and step 3, the time complexity is O(T), T is the same to that in BSBI.
MapReduce —— Distributed indexing
Collections are often so large that we cannot perform index construction effi-ciently on a single machine.
In the face of this situation, we use the idea of divide and rule.
- Main Idea
Distribute document collections across computer clusters for processing, using Master-Slave mode.
Master computer control the entire processing, such as when one computer suddenly breaks down while processing document sets, master hands over the document collections task.
-
Map
The document collections is divided into n pieces of data, and processed by the analyzer, into a sorted term-document_ID pair, which is the general BSBI or SPIMI algorithm. -
Reduce
The dictionary of each machine is divided into two parts: A-I and j-z, and then the A-J part is handed over to a single inverter. Therefore, only two invertors are needed here.
Note:
Analyzer and Inverter are the same computer, what’s more, a computer can serve as a analyzer and a inverter.
- Map in detail
- The document collection exists in an independent machine. The collection is divided into data pieces and transmitted to the machine served as the analyzer (IO time needs to be considered here)
- turn to term-docId pairs using BSBI or SPIMI
- Reduce in detail
- Transfer partition file to inverter by IO (IO time should be considered here)
- sort pairs in inverter with time complexity of O(TlogT)
- Write the inverted record table to an independent machine; (Note: we need to consider the size of the dictionary and the size of the posting. The size of the dictionary is the size of the term, and consider the IO time.)
In the above discussion, the premise is that the document does not change, but in fact, the document will be updated, inserted, deleted, etc;
Thus, we need to introduce dynamic index.
Dynamic index
- Main Idea
An auxiliary index (representing added documents) is constructed in memory and maintaining an invalid bit vector (representing deleted documents). When the auxiliary index is large enough, it is merged with the primary index;
When we are querying, we must merge the both results of primary index and auxiliary index, and then filter the results in terms of invalid bit.
- How to construct
introduce log(T/n) index(I0, I1, …Ii), each index size is (2^i)*n respectively.
T: an upper bound for the number of items we must sort
n: splits
- In the beginning, when the auxiliary index is full, I0 is built and saved to disk to clear the auxiliary index
- When the auxiliary index is full for the second time, it merges with I0 and becomes I1 (I0 is removed);
- When the auxiliary index is full for the third time, I0 is constructed;
- When the auxiliary index is full for the fourth time, I0, I1 and auxiliary indexes are merged to form I2 (I0 and I1 are cleared)
We can see that there is a rule: the process of building index is constructed by increasing binary number:
idx | I3 | I2 | I1 | I0 |
---|---|---|---|---|
No.1 | 0 | 0 | 0 | 0 |
No.2 | 0 | 0 | 0 | 1 |
No.3 | 0 | 0 | 1 | 0 |
No.4 | 0 | 0 | 1 | 1 |
No.5 | 0 | 1 | 0 | 0 |
No.6 | 0 | 1 | 0 | 1 |
No.7 | 0 | 1 | 1 | 0 |
No.8 | 0 | 1 | 1 | 1 |
No.9 | 1 | 0 | 0 | 0 |
No.10 | 1 | 0 | 0 | 1 |
No.11 | 1 | 0 | 1 | 0 |
… | … | … | … | … |
- Dynamic Algorithm
Access Control
Generally, the posting of inverted index is sorted according to the document ID, which is helpful for compression.
And it only needs to be inserted at the end.
However, for rank retrieval, it needs to sort by weight, but When adding operation happens, you need to scan it to determine the insertion position;
In general, access authorization refers to whether a user has access to a document through ACL (access control list), and the access control table can also be constructed with an inverted index structure.
Dictionary represents each user, while posting represents the document_ID that users can access. In other words, pc stores a user-docId matrix.
这里作为自己的笔记和总结,借鉴了manning原书还有****博主:
- iteye_17686:(传送门)
英语难免有些错误大家见谅。
大家共勉~~