elasticsearch - Fuzzy matching domain while ignoring TLD - Stack Overflow

IT技术

更新时间：2025-04-071

admin管理员组
文章数量:1356413

I have an index with a domain field that stores, for example:

 domain: "google"

What I would like to do is tell ES: "Ignore the TLD, and run a fuzzy match on the remaining part". So if someone searches for "gogle", it will ignore the "", will ignore the "", and therefore will still match the document with "google".

I can remove the TLD from the input string if required, but the domain is stored together with its TLD. How do I define an analyzer for that?

I have an index with a domain field that stores, for example:

 domain: "google"

What I would like to do is tell ES: "Ignore the TLD, and run a fuzzy match on the remaining part". So if someone searches for "gogle", it will ignore the "", will ignore the "", and therefore will still match the document with "google".

I can remove the TLD from the input string if required, but the domain is stored together with its TLD. How do I define an analyzer for that?

Share Improve this question edited Mar 30 at 11:20 Mark Rotteveel 110k229 gold badges156 silver badges223 bronze badges asked Mar 30 at 10:17 Mister_L 2,6119 gold badges37 silver badges68 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

Tldr;

If you do not want to ever match on the Top Level Domain, you might want to remove it ? or store it into another field ?

Although there are solutions even without removing it.

Solutions

The below solution will leverage creating a custom analyzer in your index.

Using ngram

Using the ngram, you could have a match on the common token. Below is an request that is going to return to token created by the n-gram.

POST _analyze
{
  "tokenizer": {
    "type": "ngram",
    "min_gram": 3,
    "max_gram": 3,
    "token_chars": [
      "letter",
      "digit"
    ]
  },
  "text": "google"
}

In this situation we would get:

google => goo, oog, ogl, gle, com
gogle => gog, ogl, gle, net

Meaning you have in common ogl, gle, each of those match is going to add to the _score

Here is a demo

PUT 79544437
{
  "settings": {
    "analysis": {
      "analyzer": {
        "ngram33": {
          "type": "custom",
          "tokenizer": "ngram-tokenizer",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      },
      "tokenizer": {
        "ngram-tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 3,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "domain": {
        "type": "text",
        "analyzer": "ngram33"
      }
    }
  }
}

PUT 79544437/_doc/1
{
  "domain": "google"
}

GET 79544437/_search
{
  "query": {
    "match": {
      "domain": {
        "query": "gogle"
      }
    }
  }
}

Custom Tokenizer

Using a custom tokenizer you could create a token per sub-domain, domain and top level domain.

The following analyzer showcase it:

POST _analyze
{
  "tokenizer": {
    "type": "simple_pattern_split",
    "pattern": "\\."
  },
  "text": "google"
}

You are getting two tokens: google and com

The index + search you would like below:

PUT 79544437
{
  "settings": {
    "analysis": {
      "analyzer": {
        "split": {
          "type": "custom",
          "tokenizer": "split-tokenizer",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      },
      "tokenizer": {
        "split-tokenizer": {
          "type": "simple_pattern_split",
          "pattern": "\\."
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "domain": {
        "type": "text",
        "analyzer": "split"
      }
    }
  }
}

PUT 79544437/_doc/1
{
  "domain": "google"
}

GET 79544437/_search
{
  "query": {
    "match": {
      "domain": {
        "query": "gogle",
        "fuzziness": "auto"
      }
    }
  }
}

本文标签： elasticsearchFuzzy matching domain while ignoring TLDStack Overflow

版权声明：本文标题：elasticsearch - Fuzzy matching domain while ignoring TLD - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1743991898a2572295.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

elasticsearch - Fuzzy matching domain while ignoring TLD - Stack Overflow

1 Answer 1

Tldr;

Solutions

Using ngram

Custom Tokenizer

更多相关文章

elasticsearch - Fuzzy matching domain while ignoring TLD - Stack Overflow

发表评论

推荐文章

javascript - Is a nested pure function still a pure function? - Stack Overflow

c++ - Make formatted string with arguments in string vector - Stack Overflow

javascript - Google API 3 open Infowindow from divs outside google maps - Stack Overflow

浏览器弹不出上网登录页面

junit - Does Java Playwright support the testing of Android devices? - Stack Overflow

热门文章

javascript - Node modules in react-native - Stack Overflow

javascript - How to run Node JS code from npm inside of Swift - Stack Overflow

javascript - Sending emojis with facebook messenger api and botkit - Stack Overflow

unicode - How do I get an ASCII code from a string in JavaScript? - Stack Overflow

python - Pytorch Histogram- but only one dimension in 3D tensor - Stack Overflow

asp.net core - No code generators are available in this project - Stack Overflow

javascript - Scrollbar on active slide for overflowing content with Fullpage.js - Stack Overflow

apache spark - I have a Hive table with multiple partitions, and in one of the partitions, I have nearly 200,000 small files, wh

function - Javascript: Call a method inside another method - Stack Overflow

azure - Modifying Spark Partition Key Without Shuffling - Stack Overflow

最新文章

树莓派的使用网线及无线连接方法及手机连接树莓派_opencv镜像

【雕爷学编程】MicroPython动手做（23）——掌控板之WiFi与蓝牙

电脑明明连接了网络，可打开浏览器搜索却显示未连接网络，这是怎么回事，一键教你解决

thinkpad电脑WiFi图标看不到的解决方案

电脑能够连上wifi，但无Internet访问，如何解决？

javascript - local storage not accepting value from object - Stack Overflow

javascript - How to disable EnterReturn Key After a function is executed because of it? - Stack Overflow

Build Image with Podman under Windows - Dockerfile contains UID &gt; 1000000000 - Stack Overflow

node.js - How to execute synchronous HTTP requests in JavaScript - Stack Overflow

elasticsearch - How do i extract data from Elastic Search? Trying to do full dumpextract of data that are stored as Json documen

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

Build Image with Podman under Windows - Dockerfile contains UID > 1000000000 - Stack Overflow