lucene index file constructure (1)

Index
- In Lucene an Index is in a directory
- All files constitute an Index
Segment
- An Index could contain a lot of Segments, each Segment is independent.
  The new added document could be build into a new Segment, different Segment can be merged.
- If files prefix is same, they belong to same Segment, like “_0”, “_1”, “_2”.
- segments.gen and segments_X is Segment’s metaldata, storage it’s propertites information.
Document
- Document is the basic unit in building Index. Different Document storage in different Segment, a Segment can contain a lot of Documents.
- New added Document is in Segment new build, when Segment be merged, different document be merged into same Segment.
Field
- A document may contain different type informations, like time, content, write and so on, it can be index separately, and be storage in different Term.
- Different Term’s Index way can be different, when analysis Term’s storage, we would explain it.
Term
- Term is the basic unit in Index. It is the string after lexical analysis and language processing

名称	文件拓展名	描述
段文件	segments_N	保存了索引包含的多少段，每个段包含多少文档。
段元数据	.si	保存了索引段的元数据信息
锁文件	write.lock	防止多个IndexWriter同时写到一份索引文件中。
复合索引文件	.cfs, .cfe	把所有索引信息都存储到复合索引文件中。
索引段的域信息	.fnm	保存此段包含的域，以及域的名称和域的索引类型。
索引段的文档信息	.fdx, .fdt	保存此段包含的文档，每篇文档中包含的域以及每个域的信息。
索引段Term信息	.tim, .tip	.tim文件中存储着每个域中Term的统计信息且保存着指向 .doc, .pos, and .pay 索引文件的指针。 .tip文件保存着Term 字典的索引信息，可支持随机访问。
文档中Term词频和跳表信息	.doc	保存此段中每个文档对应的Term频率信息。
文档中Term的位置信息	.pos	保存此段中每个文档对应的Term位置信息。
文档的有效载荷和部分位置信息	.pay	保存此段中每个文档的有效载体(payload) 和 Term的位置信息(offsets)。其中有一部分的Term位置信息存储在.pos文件中。
索引字段加权因子	.nvd, .nvm	.nvm 文件保存索引字段加权因子的元数据 .nvd 文件保存索引字段加权数据
索引文档加权因子	.dvd, .dvm	.dvm 文件保存索引文档加权因子的元数据 .dvd 文件保存索引文档加权数据
索引矢量数据	.tvx, .tvd, .tvf	.tvd 存储此段文档的Term、Term频率、位置信息、有效载荷等信息。 .tvx 索引文件，用于把特定的文档加载到内存。 .tvf 保存索引字段的矢量信息。
有效文档	.liv	保存有效文档的索引文件信息

Name	Extension	Brief Description
Segments File	segments.gen, segments_N	Stores information about segments
Lock File	write.lock	The Write lock prevents multiple IndexWriters from writing to the same file.
Compound File	.cfs	An optional “virtual” file consisting of all the other index files for systems that frequently run out of file handles.
Fields	.fnm	Stores information about the fields
Field Index	.fdx	Contains pointers to field data
Field Data	.fdt	The stored fields for documents
Term Infos	.tis	Part of the term dictionary, stores term info
Term Info Index	.tii	The index into the Term Infos file
Frequencies	.frq	Contains the list of docs which contain each term along with frequency
Positions	.prx	Stores position information about where a term occurs in the index
Norms	.nrm	Encodes length and boost factors for docs and fields
Term Vector Index	.tvx	Stores offset into the document data file
Term Vector Documents	.tvd	Contains information about each document that has term vectors
Term Vector Fields	.tvf	The field level info about term vectors
Deleted Documents	.del	Info about what files are deleted

Lucene’s index not only storage positive mapping but also storage negative mapping

Positive mapping

From Index to Term : Index –> segment –> Document –> Field –> Term
Each upper floor storage it’s children floors’ matedata. Like a province, a city, a county, they got it’s chilren’s info.
- segments_N : how many segment the Index have, how many Documents each segment have.
- .fnm : how many Fields the segment contain, each Field’s name and Index way.
- .fdx , .fdt : all Documents the segment have, how many Fields each Document have, what information each field recorded.
- .tvx , .tvd , .tvf : how many Documents the segment have, how many Fields each Document have, how many words each Field have, every words’ string, position, and so on.
  Negative mapping
Term -> Document
- .tis , .tii : Term dictionary, that is segment’s words sort by alphabet sequencely.
- .frq : posting sorted table, that is table that contain all words’ Document ID.
- .prx : the word position in Document at posting sorted table.

Primary Type

Byte : the most basic type, 8 bits long.
UInt32 : composed by 4 Bytes.
UInt64 : composed by 8 Bytes.
VInt :
- May be composed by many Bytes.
- Front byte represent lower number bit.
- For example: 51271 - [1]1000111, [1]0010000, [0]0000011
Chars : UTF-8 encoding bytes.
String : first a VInt represent Char numbers, then a series of Chars.

Lucene’s index not only storage positive mapping but also storage negative mapping

Positive mapping

Negative mapping

Primary Type