admin管理员组文章数量:1122846
I use below code to remove stop words from string but it not working:
package com.example;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
public class Main {
public static void main(String[] args) throws IOException {
String text = "The quick brown fox jumps over the lazy dog";
Analyzer analyzer = new StandardAnalyzer();
TokenStream tokenStream = analyzer.tokenStream("field", text);
CharTermAttribute charTermAttr = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
List<String> tokens = new ArrayList<>();
while (tokenStream.incrementToken()) {
tokens.add(charTermAttr.toString());
}
tokenStream.end();
System.out.println("Tokens: " + tokens);
}
}
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns=".0.0"
xmlns:xsi=";
xsi:schemaLocation=".0.0 .0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.example</groupId>
<artifactId>demo</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<mavenpiler.source>21</mavenpiler.source>
<mavenpiler.target>21</mavenpiler.target>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>10.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queryparser</artifactId>
<version>10.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analysis-common</artifactId>
<version>10.0.0</version>
</dependency>
</dependencies>
</project>
Expected result:
Tokens: [quick, brown, fox, jumps, lazy, dog]
Real result:
Tokens: [the, quick, brown, fox, jumps, over, the, lazy, dog]
As you can see the lucene version is 10.0.0
(current latest version), and Java version is 21 (current LTS)
As said in here: .html, I see default constructor of StandardAnalyzer: "Builds an analyzer with no stop words.", but it not working as I use. Does anyone know what happened?
I tried to read example on the Internet, read the docs of Apache Lucene 10.0.0
I use below code to remove stop words from string but it not working:
package com.example;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
public class Main {
public static void main(String[] args) throws IOException {
String text = "The quick brown fox jumps over the lazy dog";
Analyzer analyzer = new StandardAnalyzer();
TokenStream tokenStream = analyzer.tokenStream("field", text);
CharTermAttribute charTermAttr = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
List<String> tokens = new ArrayList<>();
while (tokenStream.incrementToken()) {
tokens.add(charTermAttr.toString());
}
tokenStream.end();
System.out.println("Tokens: " + tokens);
}
}
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.example</groupId>
<artifactId>demo</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.source>21</maven.compiler.source>
<maven.compiler.target>21</maven.compiler.target>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>10.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queryparser</artifactId>
<version>10.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analysis-common</artifactId>
<version>10.0.0</version>
</dependency>
</dependencies>
</project>
Expected result:
Tokens: [quick, brown, fox, jumps, lazy, dog]
Real result:
Tokens: [the, quick, brown, fox, jumps, over, the, lazy, dog]
As you can see the lucene version is 10.0.0
(current latest version), and Java version is 21 (current LTS)
As said in here: https://lucene.apache.org/core/10_0_0/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html, I see default constructor of StandardAnalyzer: "Builds an analyzer with no stop words.", but it not working as I use. Does anyone know what happened?
I tried to read example on the Internet, read the docs of Apache Lucene 10.0.0
Share Improve this question edited Nov 21, 2024 at 9:53 Olaf Kock 48k9 gold badges62 silver badges91 bronze badges asked Nov 21, 2024 at 9:17 tle130475ctle130475c 31 silver badge2 bronze badges1 Answer
Reset to default 1The StandardAnalyzer
constructor you are using is this:
Analyzer analyzer = new StandardAnalyzer();
As you note in your question, this constructor "builds an analyzer with no stop words".
That means the analyzer does not have a list of stopwords - and therefore does not have any information about what stopwords you want to remove/ignore when you build your index.
(It doesn't mean "there will be no stopwords in your index" - it actually means the opposite: There will be no stopwords removed from your index.)
You can use one of the other constructors, which allow you to provide that missing list.
For example, StandardAnalyzer(CharArraySet stopWords)
A simple example:
import org.apache.lucene.analysis.CharArraySet;
...
CharArraySet stopWords = new CharArraySet(2, true);
stopWords.add("foo");
stopWords.add("bar");
Analyzer analyzer = new StandardAnalyzer(stopWords);
Or you can use StandardAnalyzer(Reader reader)
. In this case you can provide the stopwords in a file (for example). The file will be a simple text file, with one stopword on each line.
There is a list of stopwords built into Lucene, but they are used directly by the EnglishAnalyzer
, not the StandardAnalyzer
.
So, you could use that analyzer if you wanted to.
For reference, this was a change to Lucene that happened back in version 8: Move ENGLISH_STOP_WORD_SET from StandardAnalyzer to EnglishAnalyzer.
That makes sense, since the stopwords list in Lucene is in English.
Older code samples of Lucene may use StandardAnalyzer()
and automatically remove stopwords, for this reason. Maybe that is what you have seen somewhere.
The list of English stopwords used by Lucene can be seen here in the source code.
For reference:
"a", "an", "and", "are", "as", "at", "be",
"but", "by", "for", "if", "in", "into", "is",
"it", "no", "not", "of", "on", "or", "such",
"that", "the", "their", "then", "there",
"these", "they", "this", "to", "was", "will", "with"
本文标签: javaCan not remove stop words using StandardAnalyzer from Apache LuceneStack Overflow
版权声明:本文标题:java - Can not remove stop words using StandardAnalyzer from Apache Lucene - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736312163a1935000.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论