lucene index file constructure (2)

1. Prefix & Suffix

Lucene negative index will storage Term Dictionary. All Term sort by ‘dictionary sequence’, but Dictionary contain all word in documents, some words have long length and it’s occupy space will be astonish. In Lucene, when some word has same prefix with is privious word[may be it’s prefix is offset from it’s privious one], word storage it’s prefix’s offset in word then suffix.

Nature storage:

[VInt = 4] [t][e][r][m]
[VInt = 10][t][e][r][m][a][g][a][n][c][y]
[VInt = 9][t][e][r][m][a][g][a][n][t]
[VInt = 8][t][e][r][m][i][n][a][l]

Compress storage:

[VInt = 4] [t][e][r][m]
[VInt = 4 (offset)][VInt = 6][a][g][a][n][c][y]
[VInt = 8 (offset)][VInt = 1][t]
[VInt = 4(offset)][VInt = 4][i][n][a][l]

$4[term]\times3+4[agan]-3[offset]=13bytes$

Reduce occupy space, especially in dictionary sequence order condition, prefixs’ has big repeat probability

2.Delta

Lucene negative index need to storage a lot of Integer number like Id, position, frequence and so on. For reduce the occupy space of Integer, when Integer increase, it only storage it’s difference value.

16386 16387 16388 16389
[(1) 000, 0010][(1) 000, 0000][(0) 000, 0001]
[(1) 000, 0011][(1) 000, 0000][(0) 000, 0001]
[(1) 000, 0100][(1) 000, 0000][(0) 000, 0001]
[(1) 000, 0101][(1) 000, 0000][(0) 000, 0001]

16386 16387 16388 16389
[(1) 000, 0010][(1) 000, 0000][(0) 000, 0001]
[(0) 000, 0001]
[(0) 000, 0001]
[(0) 000, 0001]

$2[high num bytes]\times3 = 6$

3. Flag (A, B?)

There are same condition in Lucene index constructure, after block maybe have another block , maybe not. Nature storage is add a byte to flag it’s boolean. Actually a bit is enough, put origin data move left a bit, save a bit to show it’s boolean. Data minuse 2 got it’s origin data.

Obey

.frq DocDelta[, Freq?], DocSkip[, PayloadLength?]
.prx positionDelta[, Playload?](incomplete)

Not obey

.frq SkipChildLevePointer multi skip, when last level no need flag.
.tvf Positions, Offsets

Summary

When A, B condition occur a lot, save 8 times space is economical.
For not obey, cause have a global macro, it is effective in all Field even all index.

4.Skip list

For inprove search performance Lucene use a lot of skip list data structure.

Element is sequential, in lucene either number order or alphabet order.
Skip have interval, it’s a config in advance, that is skip element number.
Skip list have level, it’s config interve point it’s up level [12 -> 5 -> 2][interval 3]

Define difference

the interval define :
- not contain up level two element
- contain head, not contain end[Lucene define]
- contain both it’s up level two element
the level define :
- contain origin list, start from 1, like 1, 2, 3
- contain origin list, start from 0, like 0, 1, 2
- not contain origin list, start from 0[Lucene define]

Jump element & Skip list

http://lucene.apache.org/core/2_9_4/fileformats.html