Lucene fmn index & output type

Start fmn index

IndexWriter:addDocument -> DocumentWriter:addDocument

Document object in lucene store origin target’s properties
The property fieldInfos in DocumentWriter store fields

1
2
3
4
5
private FieldInfos fieldInfos;

fieldInfos = new FieldInfos();
fieldInfos.add(doc);
fieldInfos.write(directory, segment + ".fnm");

Build fmn index

1
2
3
4
5
6
7
8
9
10
11
12

public void write(IndexOutput output) throws IOException {
output.writeVInt(size());
for (int i = 0; i < size(); i++) {
FieldInfo fi = fieldInfo(i);
byte bits = 0x0;
// option bit setting
...
output.writeString(fi.name);
output.writeByte(bits);
}
}

write field size before loop
write field’s name in loop
write field’s setting in loop

If a document field’s name is name store in index and tokenized
then xxx.fmn will be like this

1
01 04 6e 61 6d 65 01

01 is fields size
04 is field length
6e 61 6d 65 is ASCII of name
01 is field setting

[size][01 lenght][01][01 setting][02 lenght][02][02 setting]……

Lucene type

In abstract class IndexOutput self define type method

1
2
3
4
5
6
7
8
9
10
public void writeVInt(int i) throws IOException {
while ((i & ~0x7F) != 0) {
writeByte((byte)((i & 0x7f) | 0x80));
i >>>= 7;
}
writeByte((byte)i);
}
public void writeVLong(long i) throws IOException {
...
}

The variable int/long format, smaller values take fewer bytes
int writes between one and five bytes.
long writes between one and nine bytes.

0x7F filt origin type no longer than 1 byte.
(i & 0x7F) | 0x80 make byte first bit is 1.


1
2
3
4
5
public void writeString(String s) throws IOException {
int length = s.length();
writeVInt(length);
writeChars(s, 0, length);
}

Write string’s length first.
Write String’s char array then.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
public void writeChars(String s, int start, int length) throws IOException {
final int end = start + length;
for (int i = start; i < end; i++) {
final int code = (int) s.charAt(i);
if (code >= 0x01 && code <= 0x7F)
writeByte((byte) code);
else if (((code >= 0x80) && (code <= 0x7FF)) || code == 0) {
writeByte((byte) (0xC0 | (code >> 6)));
writeByte((byte) (0x80 | (code & 0x3F)));
} else {
writeByte((byte) (0xE0 | (code >>> 12)));
writeByte((byte) (0x80 | ((code >> 6) & 0x3F)));
writeByte((byte) (0x80 | (code & 0x3F)));
}
}
}

Write ASCII directly [1 byte]
Write UTF-8 two bytes [2 byte]
Write UTF-8 three bytes [3 byte]