lucene index file constructure (1)

  • Index
    • In Lucene an Index is in a directory
    • All files constitute an Index
  • Segment
    • An Index could contain a lot of Segments, each Segment is independent.
      The new added document could be build into a new Segment, different Segment can be merged.
    • If files prefix is same, they belong to same Segment, like “_0”, “_1”, “_2”.
    • segments.gen and segments_X is Segment’s metaldata, storage it’s propertites information.
  • Document
    • Document is the basic unit in building Index. Different Document storage in different Segment, a Segment can contain a lot of Documents.
    • New added Document is in Segment new build, when Segment be merged, different document be merged into same Segment.
  • Field
    • A document may contain different type informations, like time, content, write and so on, it can be index separately, and be storage in different Term.
    • Different Term’s Index way can be different, when analysis Term’s storage, we would explain it.
  • Term
    • Term is the basic unit in Index. It is the string after lexical analysis and language processing
名称 文件拓展名 描述
段文件 segments_N 保存了索引包含的多少段,每个段包含多少文档。
段元数据 .si 保存了索引段的元数据信息
锁文件  write.lock 防止多个IndexWriter同时写到一份索引文件中。
复合索引文件 .cfs, .cfe 把所有索引信息都存储到复合索引文件中。
索引段的域信息 .fnm 保存此段包含的域,以及域的名称和域的索引类型。
索引段的文档信息 .fdx, .fdt 保存此段包含的文档,每篇文档中包含的域以及每个域的信息。
索引段Term信息 .tim, .tip .tim文件中存储着每个域中Term的统计信息且保存着指向 .doc, .pos, and .pay 索引文件的指针。 .tip文件保存着Term 字典的索引信息,可支持随机访问。
文档中Term词频和跳表信息 .doc 保存此段中每个文档对应的Term频率信息。
文档中Term的位置信息 .pos 保存此段中每个文档对应的Term位置信息。
文档的有效载荷和部分位置信息 .pay 保存此段中每个文档的有效载体(payload) 和 Term的位置信息(offsets)。 其中有一部分的Term位置信息存储在.pos文件中。
索引字段加权因子 .nvd, .nvm .nvm 文件保存索引字段加权因子的元数据 .nvd 文件保存索引字段加权数据
索引文档加权因子 .dvd, .dvm .dvm 文件保存索引文档加权因子的元数据 .dvd 文件保存索引文档加权数据
索引矢量数据 .tvx, .tvd, .tvf .tvd 存储此段文档的Term、Term频率、位置信息、有效载荷等信息。 .tvx 索引文件,用于把特定的文档加载到内存。 .tvf 保存索引字段的矢量信息。
有效文档 .liv 保存有效文档的索引文件信息
Name Extension Brief Description
Segments File segments.gen, segments_N Stores information about segments
Lock File write.lock The Write lock prevents multiple IndexWriters from writing to the same file.
Compound File .cfs An optional “virtual” file consisting of all the other index files for systems that frequently run out of file handles.
Fields .fnm Stores information about the fields
Field Index .fdx Contains pointers to field data
Field Data .fdt The stored fields for documents
Term Infos .tis Part of the term dictionary, stores term info
Term Info Index .tii The index into the Term Infos file
Frequencies .frq Contains the list of docs which contain each term along with frequency
Positions .prx Stores position information about where a term occurs in the index
Norms .nrm Encodes length and boost factors for docs and fields
Term Vector Index .tvx Stores offset into the document data file
Term Vector Documents .tvd Contains information about each document that has term vectors
Term Vector Fields .tvf The field level info about term vectors
Deleted Documents .del Info about what files are deleted

Lucene’s index not only storage positive mapping but also storage negative mapping

Positive mapping

  • From Index to Term : Index –> segment –> Document –> Field –> Term
  • Each upper floor storage it’s children floors’ matedata. Like a province, a city, a county, they got it’s chilren’s info.
    • segments_N : how many segment the Index have, how many Documents each segment have.
    • .fnm : how many Fields the segment contain, each Field’s name and Index way.
    • .fdx , .fdt : all Documents the segment have, how many Fields each Document have, what information each field recorded.
    • .tvx , .tvd , .tvf : how many Documents the segment have, how many Fields each Document have, how many words each Field have, every words’ string, position, and so on.

      Negative mapping

  • Term -> Document
    • .tis , .tii : Term dictionary, that is segment’s words sort by alphabet sequencely.
    • .frq : posting sorted table, that is table that contain all words’ Document ID.
    • .prx : the word position in Document at posting sorted table.

Primary Type

  • Byte : the most basic type, 8 bits long.
  • UInt32 : composed by 4 Bytes.
  • UInt64 : composed by 8 Bytes.
  • VInt :
    • May be composed by many Bytes.
    • Front byte represent lower number bit.
    • For example: 51271 - [1]1000111, [1]0010000, [0]0000011
  • Chars : UTF-8 encoding bytes.
  • String : first a VInt represent Char numbers, then a series of Chars.