ElasticSearch selecting the correct index creation settings
Friends!
Maybe someone has come across this and can point you in the right direction.
Task: There is a table in a mysql database with 50,000 rows of nomenclature. The table contains 1 name field. I load rows into the index using logstash.
It is necessary to perform a relevant search for similar strings based on the entered query.
Here is my index now:
{
"settings": {
"index": {
"max_ngram_diff": 20,
"similarity": {
"default": {
"type": "BM25",
"b": 0.75,
"k1": 0
}
}
},
"analysis": {
"tokenizer": {
"custom_tokenizer": {
"type": "pattern",
"pattern": "[\\s,]+"
}
},
"char_filter": {
"comma_replacer": {
"type": "pattern_replace",
"pattern": ",",
"replacement": " "
}
},
"filter": {
"custom_ngram": {
"type": "ngram",
"min_gram": 2,
"max_gram": 6
},
"custom_pattern_capture": {
"type": "pattern_capture",
"preserve_original": true,
"patterns": ["(^|\\s)([\\p{L}\\p{N}]{1})(\\s|$)"]
}
},
"analyzer": {
"custom_analyzer": {
"type": "custom",
"tokenizer": "custom_tokenizer",
"char_filter": ["comma_replacer"],
"filter": ["custom_pattern_capture", "custom_ngram"]
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "custom_analyzer"
}
}
}
}
But it doesn't do what I need
I want to achieve this division into tokens:
I need to create an index in elasticsearch so that the phrase “Truba 32412 d50 L 3 1.5” is parsed into the following tokens:
I will write according to words
Truba: "Tr", "Tru", "Trub", "Truba", "ru", "rub", "ruba", "ub", "uba", "ba"
32412: "32","324","3241","32412","24","241","2412","41","412","12"
d50: "d","50"
L: "L"
3: "3"
1.5 : "1.5"
Step by step:
- The phrase is divided into parts using spaces and punctuation marks.
- All these parts are saved as separate tokens.
- Further, these parts are further divided into tokens depending on what symbols they consist of:
3.1 If a token consists only of letters and its length is > 1, then ngram from 2 to 6 is applied to it.
3.2 If a token consists only of numbers and its length is > 1, then ngram from 2 to 6 is applied to it.
3.3 If the token is a real number, then we save it entirely.
3.4 If a token consists of letters and numbers, then we divide it, for example, DN500 -> “DN” and “500” and apply rules 2., 3.1, 3.2, 3.3, 3.4 to each of the resulting tokens sequentially
I think that such an algorithm will make my search as accurate as possible, but I don’t know how to correctly formulate a request to create an index with such functionality. I would really appreciate your help!