人工智能学习之倒排索引

倒排索引

倒排索引（英语：Inverted index），也常被称为反向索引、置入档案或反向档案，是一种索引方法，被用来存储在全文搜索下某个单词在一个文档或者一组文档中的存储位置的映射。它是文档检索系统中最常用的数据结构。

反向索引数据结构是典型的搜索引擎检索算法重要的部分。
一个搜索引擎执行的目标就是优化查询的速度：找到某个单词在文档中出现的地方。以前，正向索引开发出来用来存储每个文档的单词的列表，接着掉头来开发了一种反向索引。正向索引的查询往往满足每个文档有序频繁的全文查询和每个单词在校验文档中的验证这样的查询。
实际上，时间、内存、处理器等等资源的限制，技术上正向索引是不能实现的。
为了替代正向索引的每个文档的单词列表，能列出每个查询的单词所有所在文档的列表的反向索引数据结构开发了出来。
随着反向索引的创建，如今的查询能通过立即的单词标示迅速获取结果（经过随机存储）。随机存储也通常被认为快于顺序存储。

（引自维基百科）

索引构建过程

词条序列
排序
词典&倒排记录表
- 某个词项在单篇文档中的多次出现会被合并。
- 拆分成词典和倒排记录表两部分。
- 每个词项出现的文档数目(doc. frequency, DF)会被加入。

任务

子任务1：建索引
子任务2：布尔查询
文本处理要求
1. 不要求做词条变化如friends -> friend等，直接用空格作为分割符。
2. 都转成小写。
要求
- 输入
  1. 给定的文档
  2. 一个布尔查询Q（And Or 操作）
  3. 支持连续输入查询
- 输出
  1. 倒排索引表（存成文件dict.index)
  2. 从文件中读取索引用于查询
  3. Q的查询结果（界面输出文档名列表）

python实现

create_inverse_index.py

创建索引表的文件，事实上在实现时并没有完全按照上面的方法构建索引表，利用python的 set 类可以把步骤合起来。

# -*- coding: utf-8 -*-

import json

# 文件列表
documents = ['d1.txt','d2.txt','d3.txt','d4.txt','d5.txt','d6.txt','d7.txt','d8.txt','d9.txt','d10.txt']

# 给每个文件编号：id
document_map = dict(zip(range(len(documents)), documents))

# 读入文件建立倒排索引
def create_index():
    #  单词到对应的文章id
    # {'word': [1, 2, 3]}
    index = {}
    for id, document in document_map.items():
        # 把文件内容读出存放到 article 中
        document = 'files/' + document
        f = open(document,encoding='utf-8')
        article = f.read()
        f.close()
        words = article.split()
        for word in words:
            word = word.lower()
            if word in index:
                index[word].add(id)
            else:
                index[word] = set([id])
    return index

# 把索引写入 dict.index
def write_index_file(index):
    f = open('dict.index.txt','w',encoding='utf-8')
    for word,docIDS in index.items():
        indexing = word + '\t' + str(len(docIDS)) + '\t'
        for docID in docIDS:
            indexing += (' ' + str(docID))
        f.write(indexing+'\n')
    f.close()

# 把文件 ID 与文件名对应表写入文件 doc_map.txt
# 为了方便读取直接以字典的形式写入文件
def write_document_map():
    f = open('doc_map.txt','w',encoding='utf-8')
    handler = json.dumps(document_map)
    f.write(handler)
    f.close()
    print('doc_map.txt 创建成功！')

def main():
    index = create_index()
    write_index_file(index)
    write_document_map()
    print('倒排索引表 dict.index 创建成功！')

if __name__ == '__main__':
    main()

search.py

布尔查询的文件。需要注意的是其中的 search_index 方法，这里因为约定了输入查询的格式所以对于keywords的数量可以明确知道，因此使用下表的方法获得数据。事实上平时使用的检索（如百度）不会限制查询的词数（当然也不能太长了）而且各个词之间都是and布尔操作。

# -*- coding: utf-8 -*-

import re
import json

# 考虑到查询操作需要多次执行，把 index 作为全局变量，否则 search_index() 需要传递
# index 参数, document_map 同
index = {}
document_map = {}

# 从文件中读取索引创建索引字典
def create_index(file_name):
    inverse_index = {}
    f = open(file_name,'r')
    line = f.readline().rstrip()
    while line:
        line = re.split(r'\s+',line)
        # print(line)
        inverse_index[line[0]] = set()
        for id in line[2:]:
            inverse_index[line[0]].add(id)
        line = f.readline().rstrip()
    return inverse_index

# 从文件中读取 ID 表创建 document_map
def create_map(file_name):
    doc_map = {}
    f = open(file_name,'r')
    handler = f.read()
    doc_map = json.loads(handler)
    f.close()
    return doc_map

# 在索引字典中搜索 keywords
# 因为输入查询的格式已经约定所以使用了下标的方法获得数据
def search_index(keywords):
    if len(keywords) > 1:
        ids = index.get(keywords[0],set())
        if keywords[1] == 'and':
            ids = ids & index.get(keywords[2],set())
        elif keywords[1] == 'or':
            ids = ids | index.get(keywords[2],set())
    else:
        ids = index.get(keywords[0],set())
    # result = ''
    # for id in ids:
    #     result += (id+' ')
    # print(result)
    print_result(ids)

# 输出结果函数
def print_result(ids):
    if len(ids) == 0:
         print("sorry,can not find the message!")
    else:
        result = ''
        for id in ids:
            result += (document_map[id]+' ')
        print(result)

def main():
    # 从文件中读取索引创建索引字典用于查询
    global index 
    index = create_index('dict.index.txt')
    # 读入 doc_map.txt 创建 document_map
    global document_map
    document_map = create_map('doc_map.txt')
    # 构造匹配规则
    # pattern = re.compile(r'^([0-9a-zA-Z\-\_]+)(\s+)?((and|or)(\s+)?([0-9a-zA-Z\_\-]+))?$')
    pattern = re.compile(r'^(\S+)(\s+)?((and|or)(\s+)?(\S+))?$')
    while True:
        query = input('请输入查询的内容：')
        query = query.lower()
        if query == 'exit':
            break

        if re.match(pattern,query):
            # 切词
            keywords = re.split(r'\s+',query)
            if keywords:
                search_index(keywords)
        else:
            print('输入格式不正确！')

if __name__ == '__main__':
    main()

测试样例

测试文档可以自己编写。

请输入查询的内容：rose
d1.txt 
请输入查询的内容：in
d3.txt d8.txt d6.txt d7.txt d4.txt d1.txt d9.txt d2.txt d10.txt 
请输入查询的内容：rose and in
d1.txt 
请输入查询的内容：rose or in
d3.txt d8.txt d6.txt d7.txt d4.txt d1.txt d9.txt d2.txt d10.txt 
请输入查询的内容：asf
sorry,can not find the message!
请输入查询的内容：rose and
输入格式不正确！