admin管理员组

文章数量:1391929

Trying to import RegexTextSplitter using

from langchain.text_splitter import RegexTextSplitter ,RecursiveCharacterTextSplitter

And I get the error

from langchain.text_splitter import RegexTextSplitter ,RecursiveCharacterTextSplitteImportError: cannot import name 'RegexTextSplitter' from 'langchain.text_splitter'

Doing

from langchain_text_splitters import RegexTextSplitter, RecursiveCharacterTextSplitter

as suggested here does not work either.

from langchain_text_splitters import RegexTextSplitter, RecursiveCharacterTextSplitterImportError: cannot import name 'RegexTextSplitter' from 'langchain_text_splitters'

Is RegexTextSplitter just not present in the latest version of langchain? Then why is this piece of documentation available?

from langchain.text_splitter import RegexTextSplitter

Running dir() gives

['CharacterTextSplitter', 'ElementType', 'ExperimentalMarkdownSyntaxTextSplitter', 'HTMLHeaderTextSplitter', 'HTMLSectionSplitter', 'HTMLSemanticPreservingSplitter', 'HeaderType', 'KonlpyTextSplitter', 'Language', 'LatexTextSplitter', 'LineType', 'MarkdownHeaderTextSplitter', 'MarkdownTextSplitter', 'NLTKTextSplitter', 'PythonCodeTextSplitter', 'RecursiveCharacterTextSplitter', 'RecursiveJsonSplitter', 'SentenceTransformersTokenTextSplitter', 'SpacyTextSplitter', 'TextSplitter', 'TokenTextSplitter', 'Tokenizer', 'annotations', 'builtins', 'cached', 'doc', 'file', 'loader', 'name', 'package', 'spec', 'split_text_on_tokens']

Which one is the best to split a document by Regex? Thanks

Trying to import RegexTextSplitter using

from langchain.text_splitter import RegexTextSplitter ,RecursiveCharacterTextSplitter

And I get the error

from langchain.text_splitter import RegexTextSplitter ,RecursiveCharacterTextSplitteImportError: cannot import name 'RegexTextSplitter' from 'langchain.text_splitter'

Doing

from langchain_text_splitters import RegexTextSplitter, RecursiveCharacterTextSplitter

as suggested here does not work either.

from langchain_text_splitters import RegexTextSplitter, RecursiveCharacterTextSplitterImportError: cannot import name 'RegexTextSplitter' from 'langchain_text_splitters'

Is RegexTextSplitter just not present in the latest version of langchain? Then why is this piece of documentation available?

from langchain.text_splitter import RegexTextSplitter

Running dir() gives

['CharacterTextSplitter', 'ElementType', 'ExperimentalMarkdownSyntaxTextSplitter', 'HTMLHeaderTextSplitter', 'HTMLSectionSplitter', 'HTMLSemanticPreservingSplitter', 'HeaderType', 'KonlpyTextSplitter', 'Language', 'LatexTextSplitter', 'LineType', 'MarkdownHeaderTextSplitter', 'MarkdownTextSplitter', 'NLTKTextSplitter', 'PythonCodeTextSplitter', 'RecursiveCharacterTextSplitter', 'RecursiveJsonSplitter', 'SentenceTransformersTokenTextSplitter', 'SpacyTextSplitter', 'TextSplitter', 'TokenTextSplitter', 'Tokenizer', 'annotations', 'builtins', 'cached', 'doc', 'file', 'loader', 'name', 'package', 'spec', 'split_text_on_tokens']

Which one is the best to split a document by Regex? Thanks

Share Improve this question asked Mar 12 at 13:13 Dev_ADev_A 235 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 2

The RegexTextSplitter was deprecated. The introduction of the RecursiveCharacterTextSplitter class, which supports regular expressions through the is_separator_regex parameter, offers a more flexible and unified approach to text splitting. You can use it like this:

from langchain.text_splitter import RecursiveCharacterTextSplitter
# or alternatively:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Define your regex separators
separators = [r'\n\n', r'\n', r'(?<=[.?!])\s+']

# Initialize the RecursiveCharacterTextSplitter with regex separators
text_splitter = RecursiveCharacterTextSplitter(
    separators=separators,
    is_separator_regex=True,
    chunk_size=1000,  # Set your desired chunk size
    chunk_overlap=0   # Set your desired chunk overlap
)

# Example text to split
text = """
Mr. Smith bought cheapsite for 1.5 million dollars, i.e., he paid a lot for it.
Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true...
Well, with a probability of .9 it isn't.
"""

# Split the text
chunks = text_splitter.split_text(text)

# Output the chunks
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:\n{chunk}\n")

The is_separtor_regex = True is crucial when you want to use regex expressions as separators!

本文标签: pythonRegexTextSplitter does not exist in langchaintextsplittersStack Overflow