ElasticSearch selecting the correct index creation settings

Friends!

Maybe someone has come across this and can point you in the right direction.

Task: There is a table in a mysql database with 50,000 rows of nomenclature. The table contains 1 name field. I load rows into the index using logstash.

It is necessary to perform a relevant search for similar strings based on the entered query.

Here is my index now:


{

  "settings": {

    "index": {

      "max_ngram_diff": 20,

      "similarity": {

        "default": {

          "type": "BM25",

          "b": 0.75,

          "k1": 0

        }

      }

    },

    "analysis": {

      "tokenizer": {

        "custom_tokenizer": {

          "type": "pattern",

          "pattern": "[\\s,]+"

        }

      },

      "char_filter": {

        "comma_replacer": {

          "type": "pattern_replace",

          "pattern": ",",

          "replacement": " "

        }

      },

      "filter": {

        "custom_ngram": {

          "type": "ngram",

          "min_gram": 2,

          "max_gram": 6

        },

        "custom_pattern_capture": {

          "type": "pattern_capture",

          "preserve_original": true,

          "patterns": ["(^|\\s)([\\p{L}\\p{N}]{1})(\\s|$)"]

        }

      },

      "analyzer": {

        "custom_analyzer": {

          "type": "custom",

          "tokenizer": "custom_tokenizer",

          "char_filter": ["comma_replacer"],

          "filter": ["custom_pattern_capture", "custom_ngram"]

        }

      }

    }

  },

  "mappings": {

    "properties": {

      "name": {

        "type": "text",

        "analyzer": "custom_analyzer"

      }

    }

  }

}

But it doesn't do what I need

I want to achieve this division into tokens:

I need to create an index in elasticsearch so that the phrase “Truba 32412 d50 L 3 1.5” is parsed into the following tokens:

I will write according to words

Truba: "Tr", "Tru", "Trub", "Truba", "ru", "rub", "ruba", "ub", "uba", "ba"

32412: "32","324","3241","32412","24","241","2412","41","412","12"

d50: "d","50"

L: "L"

3: "3"

1.5 : "1.5"

Step by step:

  1. The phrase is divided into parts using spaces and punctuation marks.
  2. All these parts are saved as separate tokens.
  3. Further, these parts are further divided into tokens depending on what symbols they consist of:

3.1 If a token consists only of letters and its length is > 1, then ngram from 2 to 6 is applied to it.

3.2 If a token consists only of numbers and its length is > 1, then ngram from 2 to 6 is applied to it.

3.3 If the token is a real number, then we save it entirely.

3.4 If a token consists of letters and numbers, then we divide it, for example, DN500 -> “DN” and “500” and apply rules 2., 3.1, 3.2, 3.3, 3.4 to each of the resulting tokens sequentially

I think that such an algorithm will make my search as accurate as possible, but I don’t know how to correctly formulate a request to create an index with such functionality. I would really appreciate your help!


Ответы (0 шт):