java - Can not remove stop words using StandardAnalyzer from Apache Lucene - Stack Overflow

IT技术

更新时间：2025-01-088

admin管理员组
文章数量:1122846

I use below code to remove stop words from string but it not working:

package com.example;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

public class Main {
    public static void main(String[] args) throws IOException {
        String text = "The quick brown fox jumps over the lazy dog";

        Analyzer analyzer = new StandardAnalyzer();
        TokenStream tokenStream = analyzer.tokenStream("field", text);
        CharTermAttribute charTermAttr = tokenStream.addAttribute(CharTermAttribute.class);

        tokenStream.reset();
        List<String> tokens = new ArrayList<>();
        while (tokenStream.incrementToken()) {
            tokens.add(charTermAttr.toString());
        }
        tokenStream.end();

        System.out.println("Tokens: " + tokens);
    }
}

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns=".0.0"
    xmlns:xsi=";
    xsi:schemaLocation=".0.0 .0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.example</groupId>
    <artifactId>demo</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <mavenpiler.source>21</mavenpiler.source>
        <mavenpiler.target>21</mavenpiler.target>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>10.0.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-queryparser</artifactId>
            <version>10.0.0</version>
        </dependency>


        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analysis-common</artifactId>
            <version>10.0.0</version>
        </dependency>
    </dependencies>
</project>

Expected result: Tokens: [quick, brown, fox, jumps, lazy, dog]

Real result: Tokens: [the, quick, brown, fox, jumps, over, the, lazy, dog]

As you can see the lucene version is 10.0.0 (current latest version), and Java version is 21 (current LTS)

As said in here: .html, I see default constructor of StandardAnalyzer: "Builds an analyzer with no stop words.", but it not working as I use. Does anyone know what happened?

I tried to read example on the Internet, read the docs of Apache Lucene 10.0.0

I use below code to remove stop words from string but it not working:

package com.example;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

public class Main {
    public static void main(String[] args) throws IOException {
        String text = "The quick brown fox jumps over the lazy dog";

        Analyzer analyzer = new StandardAnalyzer();
        TokenStream tokenStream = analyzer.tokenStream("field", text);
        CharTermAttribute charTermAttr = tokenStream.addAttribute(CharTermAttribute.class);

        tokenStream.reset();
        List<String> tokens = new ArrayList<>();
        while (tokenStream.incrementToken()) {
            tokens.add(charTermAttr.toString());
        }
        tokenStream.end();

        System.out.println("Tokens: " + tokens);
    }
}

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.example</groupId>
    <artifactId>demo</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>21</maven.compiler.source>
        <maven.compiler.target>21</maven.compiler.target>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>10.0.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-queryparser</artifactId>
            <version>10.0.0</version>
        </dependency>


        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analysis-common</artifactId>
            <version>10.0.0</version>
        </dependency>
    </dependencies>
</project>

Expected result: Tokens: [quick, brown, fox, jumps, lazy, dog]

Real result: Tokens: [the, quick, brown, fox, jumps, over, the, lazy, dog]

As you can see the lucene version is 10.0.0 (current latest version), and Java version is 21 (current LTS)

As said in here: https://lucene.apache.org/core/10_0_0/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html, I see default constructor of StandardAnalyzer: "Builds an analyzer with no stop words.", but it not working as I use. Does anyone know what happened?

I tried to read example on the Internet, read the docs of Apache Lucene 10.0.0

Share Improve this question edited Nov 21, 2024 at 9:53 Olaf Kock 48k9 gold badges62 silver badges91 bronze badges asked Nov 21, 2024 at 9:17 tle130475c 31 silver badge2 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 1

The StandardAnalyzer constructor you are using is this:

Analyzer analyzer = new StandardAnalyzer();

As you note in your question, this constructor "builds an analyzer with no stop words".

That means the analyzer does not have a list of stopwords - and therefore does not have any information about what stopwords you want to remove/ignore when you build your index.

(It doesn't mean "there will be no stopwords in your index" - it actually means the opposite: There will be no stopwords removed from your index.)

You can use one of the other constructors, which allow you to provide that missing list.

For example, StandardAnalyzer(CharArraySet stopWords)

A simple example:

import org.apache.lucene.analysis.CharArraySet;

...

CharArraySet stopWords = new CharArraySet(2, true); 
stopWords.add("foo");
stopWords.add("bar");

Analyzer analyzer = new StandardAnalyzer(stopWords);

Or you can use StandardAnalyzer(Reader reader). In this case you can provide the stopwords in a file (for example). The file will be a simple text file, with one stopword on each line.

There is a list of stopwords built into Lucene, but they are used directly by the EnglishAnalyzer, not the StandardAnalyzer.

So, you could use that analyzer if you wanted to.

For reference, this was a change to Lucene that happened back in version 8: Move ENGLISH_STOP_WORD_SET from StandardAnalyzer to EnglishAnalyzer.

That makes sense, since the stopwords list in Lucene is in English.

Older code samples of Lucene may use StandardAnalyzer() and automatically remove stopwords, for this reason. Maybe that is what you have seen somewhere.

The list of English stopwords used by Lucene can be seen here in the source code.

For reference:

"a", "an", "and", "are", "as", "at", "be", 
"but", "by", "for", "if", "in", "into", "is",
"it", "no", "not", "of", "on", "or", "such", 
"that", "the", "their", "then", "there",
"these", "they", "this", "to", "was", "will", "with"

本文标签： javaCan not remove stop words using StandardAnalyzer from Apache LuceneStack Overflow

版权声明：本文标题：java - Can not remove stop words using StandardAnalyzer from Apache Lucene - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1736312163a1935000.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

java - Can not remove stop words using StandardAnalyzer from Apache Lucene - Stack Overflow

1 Answer 1

更多相关文章

java - Can not remove stop words using StandardAnalyzer from Apache Lucene - Stack Overflow

发表评论

推荐文章

java - Calculated fields in ms-access return wrongtruncated values when using UCanAccess - Stack Overflow

win7锁屏壁纸更换，解除壁纸256Kb限制教程

c# - OData The request matched multiple endpoints. Matches when Delete request - Stack Overflow

java - Azure function, binding name and blob client - Stack Overflow

Why do JetBrains Mono Nerd Font settings look different in Fish terminal across systems? - Stack Overflow

热门文章

Modify php code from plugin

Enqueueing Script to footer puts it at the very bottom

Homepage not working after setting as frontpage

Windows11中Host权限修改

custom post types - Multiple search forms and respective results page templates?

How do I sign NuGet package built by CICD (e.g. GitHub or GitLab) - Stack Overflow

sql - How can I put multiple values from one column spread out into multiple columns? - Stack Overflow

node.js - how to make nodejs application port go from http to https without breaking the default 80443 access paths? - Stack Ove

css - How do I set attributes for all images in a site

scikit learn - Is there a way to load a saved SKLearn VectorStore using langchain? - Stack Overflow

最新文章

Java入门级教学（IDEA的下载与安装与JDK的环境配置）

华硕笔记本电脑用U盘重装windows系统

物理网卡MAC修改器v3.0 - 真实网卡硬件MAC地址修改，重装系统不变！

如何一键安装win7系统(一键安装win7系统步骤)

Windows 11最稳定版本详解

winapi - Win32 DrawText() ignores text color set on the device context and draws text in background color - Stack Overflow

How to get Graalvm to convert AWT Java program to exe - Stack Overflow

Embedding of sequence of events sets - Stack Overflow

hcl - How to create parallel builds foreach item in list using packer template - Stack Overflow

react hooks - My browser localstorage clears everytime i refresh - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价