lucene系列-第一章-lucene包结构及工作流程

lucene包结构及工作流程

Lucene的analysis模块主要负责词法分析及语言处理而形成Term。
Lucene的index模块主要负责索引的创建，里面有IndexWriter。
Lucene的store模块主要负责索引的读写。
Lucene的QueryParser主要负责语法分析。
Lucene的search模块主要负责对索引的搜索。
Lucene的similarity模块主要负责对相关性打分的实现。

包名	功能
org.apache.lucene.analysis	语言分析器，主要用于的切词，支持中文主要是扩展此类
org.apache.lucene.document	索引存储时的文档结构管理，类似于关系型数据库的表结构
org.apache.lucene.index	索引管理，包括索引建立、删除等
org.apache.lucene.queryParser	查询分析器，实现查询关键词间的运算，如与、或、非等
org.apache.lucene.search	检索管理，根据查询条件，检索得到结果
org.apache.lucene.store	数据存储管理，主要包括一些底层的I/O操作
org.apache.lucene.util	一些公用类

2019-04-13

搜索

3 分钟读完 (大约 498 个字)

Lucence分词原理研究

#研究背景

近期工作需要优化公司的分词器，需要改写IK源码，故而知其然知其所以然，Lucene分词相关的原理和实现研究透彻之后，可以研究ES IK分词器的源码了。以下源码的研究，是基于Lucene 6.6.0版本，Lucene源码中文档很规范，结合文档和源码学习效率很高。本文的所有内容，均来自于Lucene的文档和代码示例。

#类图

如类图所示，Lucene中和分词相关的最核心的几个类。Analyzer用于构建TokenStream(用于分析文本)，提供了用于提取文本生成词项（term）的策略。为了定义哪些Analyzer就绪，子类必须实现createComponents方法，组件分词时，调用tokenStream方法。

TokenStream枚举了一系列tokens，它们来自documents.field或者query text。TokenStream是一个抽象类，具体的实现分为Tokenizer和TokenFilter。Tokennizer的输入是Reader，TokenFilter的输入是另一个tokenFilter。

TokenStream API工作流程

1、初始化TokenStream时添加对应的属性
2、客户端调用TokenStream.reset()
3、客户端获取需要访问的属性的本地引用
4、客户端调用incrementToken()直到返回false,每次调用就可以获取对应的token的属性
5、客户端调用end()方法让TokenStream执行扫尾操作
6、客户端使用完TokenStream后调用close()释放资源

public static void main(String[] args) throws Exception {
        Analyzer analyzer = new StandardAnalyzer();
        String str = "中华人民";
        TokenStream stream = analyzer.tokenStream("content", new StringReader(str));
        CharTermAttribute attr = stream.addAttribute(CharTermAttribute.class); //1,3
        PositionIncrementAttribute posIncr = stream.addAttribute(PositionIncrementAttribute.class); //1,3
        OffsetAttribute offset = stream.addAttribute(OffsetAttribute.class); //1,3
        TypeAttribute type = stream.addAttribute(TypeAttribute.class); //1,3
        stream.reset(); //2
        int position = 0;
        while (stream.incrementToken()) { //4
            int increment = posIncr.getPositionIncrement();
            if (increment > 0) {
                position = position + increment;
                System.out.print(position + ": ");
            }

            System.out.println(attr.toString() + "," + offset.startOffset() + "->" + offset.endOffset() + "," + type.type());
        }
        stream.end(); //5
        stream.close(); //6
        analyzer.close();
    }


执行结果：
1: 中,0->1,<IDEOGRAPHIC>
2: 华,1->2,<IDEOGRAPHIC>
3: 人,2->3,<IDEOGRAPHIC>
4: 民,3->4,<IDEOGRAPHIC>

2018-07-11

搜索

25 分钟读完 (大约 3753 个字)

ElasticSearch系列笔记(2)-ElasticSearch 理论基础

本节主要介绍下ElasticSearch 理论基础。知其然知其所以然，如下的理论和算法，就是ElasticSearch的魂！