admin管理员组文章数量:1356413
I have an index with a domain field that stores, for example:
domain: "google"
What I would like to do is tell ES: "Ignore the TLD, and run a fuzzy match on the remaining part". So if someone searches for "gogle", it will ignore the "", will ignore the "", and therefore will still match the document with "google".
I can remove the TLD from the input string if required, but the domain is stored together with its TLD. How do I define an analyzer for that?
I have an index with a domain field that stores, for example:
domain: "google"
What I would like to do is tell ES: "Ignore the TLD, and run a fuzzy match on the remaining part". So if someone searches for "gogle", it will ignore the "", will ignore the "", and therefore will still match the document with "google".
I can remove the TLD from the input string if required, but the domain is stored together with its TLD. How do I define an analyzer for that?
Share Improve this question edited Mar 30 at 11:20 Mark Rotteveel 110k229 gold badges156 silver badges223 bronze badges asked Mar 30 at 10:17 Mister_LMister_L 2,6119 gold badges37 silver badges68 bronze badges1 Answer
Reset to default 0Tldr;
If you do not want to ever match on the Top Level Domain, you might want to remove it ? or store it into another field ?
Although there are solutions even without removing it.
Solutions
The below solution will leverage creating a custom analyzer in your index.
Using ngram
Using the ngram, you could have a match on the common token. Below is an request that is going to return to token created by the n-gram.
POST _analyze
{
"tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
},
"text": "google"
}
In this situation we would get:
google
=>goo
,oog
,ogl
,gle
,com
gogle
=>gog
,ogl
,gle
,net
Meaning you have in common ogl
, gle
, each of those match is going to add to the _score
Here is a demo
PUT 79544437
{
"settings": {
"analysis": {
"analyzer": {
"ngram33": {
"type": "custom",
"tokenizer": "ngram-tokenizer",
"filter": [
"lowercase",
"asciifolding"
]
}
},
"tokenizer": {
"ngram-tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"properties": {
"domain": {
"type": "text",
"analyzer": "ngram33"
}
}
}
}
PUT 79544437/_doc/1
{
"domain": "google"
}
GET 79544437/_search
{
"query": {
"match": {
"domain": {
"query": "gogle"
}
}
}
}
Custom Tokenizer
Using a custom tokenizer you could create a token per sub-domain, domain and top level domain.
The following analyzer showcase it:
POST _analyze
{
"tokenizer": {
"type": "simple_pattern_split",
"pattern": "\\."
},
"text": "google"
}
You are getting two tokens: google
and com
The index + search you would like below:
PUT 79544437
{
"settings": {
"analysis": {
"analyzer": {
"split": {
"type": "custom",
"tokenizer": "split-tokenizer",
"filter": [
"lowercase",
"asciifolding"
]
}
},
"tokenizer": {
"split-tokenizer": {
"type": "simple_pattern_split",
"pattern": "\\."
}
}
}
},
"mappings": {
"properties": {
"domain": {
"type": "text",
"analyzer": "split"
}
}
}
}
PUT 79544437/_doc/1
{
"domain": "google"
}
GET 79544437/_search
{
"query": {
"match": {
"domain": {
"query": "gogle",
"fuzziness": "auto"
}
}
}
}
本文标签: elasticsearchFuzzy matching domain while ignoring TLDStack Overflow
版权声明:本文标题:elasticsearch - Fuzzy matching domain while ignoring TLD - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1743991898a2572295.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论