admin管理员组

文章数量:1397086

I have this analyzer and it's working as expected but I wanted to tune it up a little bit. So, I had this analyzer mainly to make the dashes not split into 2 tokens. But now, it would also be cool if it could be splitting into 3 tokens. One with the dash on (123-456, as it is doing right now), one without the dash and all together (123456) and one with splitting into 2 tokens (123)(456). I tried messing around with other analyzers but none seem to make it work, does anyone have any ideas on how to approach this?

  scoringProfiles: [
    {
      name: "product_search",
      textWeights:{
        weights:{
          "title":5,
        }
      }
    },
  ],
  charFilters:[
    {
      odatatype:"#Microsoft.Azure.Search.MappingCharFilter",
      name:"dash",
      mappings:["-=>"]
    }
  ],

  analyzers:[
    {
      odatatype:"#Microsoft.Azure.Search.CustomAnalyzer",
      name:"dash-removal",
      tokenizerName:'whitespace',
      tokenFilters:['lowercase']
    }
  ]

}

I have this analyzer and it's working as expected but I wanted to tune it up a little bit. So, I had this analyzer mainly to make the dashes not split into 2 tokens. But now, it would also be cool if it could be splitting into 3 tokens. One with the dash on (123-456, as it is doing right now), one without the dash and all together (123456) and one with splitting into 2 tokens (123)(456). I tried messing around with other analyzers but none seem to make it work, does anyone have any ideas on how to approach this?

  scoringProfiles: [
    {
      name: "product_search",
      textWeights:{
        weights:{
          "title":5,
        }
      }
    },
  ],
  charFilters:[
    {
      odatatype:"#Microsoft.Azure.Search.MappingCharFilter",
      name:"dash",
      mappings:["-=>"]
    }
  ],

  analyzers:[
    {
      odatatype:"#Microsoft.Azure.Search.CustomAnalyzer",
      name:"dash-removal",
      tokenizerName:'whitespace',
      tokenFilters:['lowercase']
    }
  ]

}

Share Improve this question asked Mar 26 at 22:07 BonhartBonhart 475 bronze badges 1
  • Use a mapping_char_filter to replace dashes with an empty string Remove the dashes (123456). – Suresh Chikkam Commented Mar 27 at 4:47
Add a comment  | 

1 Answer 1

Reset to default 1

To achieve the desired tokenization behavior in Azure Cognitive Search, you can create a custom analyzer that generates tokens in three forms:​

  • Original token with dashes (e.g., 123-456)

  • Concatenated token without dashes (e.g., 123456)

  • Split tokens at the dash (e.g., 123 and 456)

Use a mapping character filter to remove dashes, enabling the creation of the concatenated token without dashes. The keyword_v2 tokenizer treats the entire input as a single token, preserving the original format.​

This filter splits tokens at delimiter characters (like dashes) and can generate multiple forms of tokens based on its configuration. For more details, see the WordDelimiterTokenFilter Class documentation.​

Sample Analyzer Configuration:

{
  "charFilters": [
    {
      "@odata.type": "#Microsoft.Azure.Search.MappingCharFilter",
      "name": "dash_removal",
      "mappings": ["-=>"]
    }
  ],
  "tokenizers": [
    {
      "@odata.type": "#Microsoft.Azure.Search.KeywordTokenizerV2",
      "name": "keyword_v2"
    }
  ],
  "tokenFilters": [
    {
      "@odata.type": "#Microsoft.Azure.Search.WordDelimiterTokenFilter",
      "name": "word_delimiter",
      "generateWordParts": true,
      "generateNumberParts": true,
      "catenateWords": true,
      "catenateNumbers": true,
      "catenateAll": true,
      "splitOnCaseChange": true,
      "splitOnNumerics": true,
      "preserveOriginal": true
    }
  ],
  "analyzers": [
    {
      "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
      "name": "custom_dash_analyzer",
      "charFilters": ["dash_removal"],
      "tokenizer": "keyword_v2",
      "tokenFilters": ["lowercase", "word_delimiter"]
    }
  ]
}

This configuration check that for an input like 123-456, the analyzer produces the tokens 123-456, 123456, 123, and 456, accommodating various search scenarios.​

For further information on token filters and their configurations, consult the TokenFilterName Struct documentation.​

本文标签: azureBreak specific words into multiple tokensStack Overflow