machine learning - Getting Unicode error while saving to disk in distiset - Stack Overflow-软件玩家

admin管理员组
文章数量:1291280

Here is my code

import argparse
import os
from typing import List
from pydantic import BaseModel, Field
from datasets import Dataset
from dotenv import load_dotenv

from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import TextGeneration

load_dotenv()
# 
################################################################################
# Script Parameters
################################################################################

parser = argparse.ArgumentParser(
    description="Generate exam questions from text files in a directory."
)
parser.add_argument(
    "--model_id",
    type=str,
    default="Qwen/Qwen2.5-7B-Instruct",
    help="Model ID for text generation",
)
parser.add_argument(
    "--tokenizer_id",
    type=str,
    default="Qwen/Qwen2.5-7B-Instruct",
    help="Tokenizer ID for text generation",
)
parser.add_argument(
    "--input_dir",
    type=str,
    help="Directory containing input text files",
    default="data",
)
parser.add_argument(
    "--max_new_tokens",
    type=int,
    default=2048,
    help="Maximum number of new tokens to generate",
)
parser.add_argument(
    "--output_path",
    type=str,
    default="exam_questions_output",
    help="Directory to save the generated datasets",
)

args = parser.parse_args()

################################################################################
# Load the documents
# We assume that the documents are in the input directory, and that each file
# is a separate document about the same topic.
################################################################################

# Process all text files in the input directory
documents = []
for filename in os.listdir(args.input_dir):
    if filename.endswith(".txt"):
        file_path = os.path.join(args.input_dir, filename)
        with open(file=file_path, mode="r", encoding="utf-8", errors="replace") as file:
            document_content = file.read()
            documents.append(document_content)

# Create a single dataset from all document contents
dataset = Dataset.from_dict({"document": documents})

################################################################################
# Define the prompts
# We use a system prompt to guide the model to generate the correct output format.
# A template is used to insert the document into the prompt.
################################################################################

SYSTEM_PROMPT = """\
You are an exam writer specialized in writing exams for students.
Your goal is to create questions and answers based on the document provided, 
and a list of distractors, that are incorrect but viable answers to the question.
Your answer must adhere to the following format:

[
    {
        "question": "Your question",
        "answer": "The correct answer to the question",
        "distractors": ["wrong answer 1", "wrong answer 2", "wrong answer 3"]
    },
    ... (more questions and answers as required)
]

""".strip()

INSTRUCTION_TEMPLATE = """\
    Generate a list of answers and questions about the document. 
    Document:\n\n{{ instruction }}"""

################################################################################
# Define the output structure
# We define a data model for the output of the pipeline, this is used to ensure
# that the output is in the correct format for the evaluation task.
################################################################################


class ExamQuestion(BaseModel):
    question: str = Field(..., description="The question to be answered")
    answer: str = Field(..., description="The correct answer to the question")
    distractors: List[str] = Field(
        ..., description="A list of incorrect but viable answers to the question"
    )


class ExamQuestions(BaseModel):
    exam: List[ExamQuestion]


################################################################################
# Create the pipeline
# We create a pipeline with a single task that generates the exam questions
# based on the document and in the correct format. We will Hugging Face
# InferenceEndpoints and the model specified in the arguments.
################################################################################

with Pipeline(
    name="Domain-Eval-Questions",
    description="Generate exam questions based on given documents.",
) as pipeline:
    # Set up the text generation task
    text_generation = TextGeneration(
        name="exam_generation",
        llm=InferenceEndpointsLLM(
            model_id=args.model_id,
            tokenizer_id=args.model_id,
            api_key=os.environ["HF_TOKEN"],
            structured_output={
                "schema": ExamQuestions.model_json_schema(),
                "format": "json",
            },
        ),
        input_batch_size=8,
        output_mappings={"model_name": "generation_model"},
        input_mappings={"instruction": "document"},
        system_prompt=SYSTEM_PROMPT,
        template=INSTRUCTION_TEMPLATE,
    )


################################################################################
# Run the pipeline
# We run the pipeline for all documents and save the results to the output path.
################################################################################

if __name__ == "__main__":
    # Run the pipeline for all documents
    distiset = pipeline.run(
        parameters={
            "exam_generation": {
                "llm": {
                    "generation_kwargs": {
                        "max_new_tokens": args.max_new_tokens,
                    }
                }
            }
        },
        use_cache=False,
        dataset=dataset,
    )
    # distiset.save_to_disk(distiset_path=args.output_path)
    try:
        distiset.save_to_disk(args.output_path)
    except UnicodeDecodeError as e:
        print(f"Unicode error while saving: {e}")

Its giving me 'charmap' codec can't decode byte 0x9d in position 34: character maps to error on distiset.save_to_disk

It should run as usual without giving me this error

本文标签： machine learningGetting Unicode error while saving to disk in distisetStack Overflow

版权声明：本文标题：machine learning - Getting Unicode error while saving to disk in distiset - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1741529421a2383666.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

machine learning - Getting Unicode error while saving to disk in distiset - Stack Overflow

更多相关文章

machine learning - Getting Unicode error while saving to disk in distiset - Stack Overflow

发表评论

推荐文章

stdout - Haskell program hanging when interacting with another program - Stack Overflow

visual studio code - vscode auto push to all remotes - Stack Overflow

javascript - Advantage of COMET over long request polling? - Stack Overflow

javascript - How to create a horizontal legend with d3.js - Stack Overflow

javascript - How to get Jquery to ShowHide Dynamic Div Element? - Stack Overflow

热门文章

testing - MockWebServer start on Android API 35 cause - Stack Overflow

javascript - return json object - Stack Overflow

javascript - Spacing between label and input box - Stack Overflow

javascript - purpose of "(function() {" at the start of scripts (GreaseMonkey only?) - Stack Overflow

typescript - class-transformer Transform nested class - Stack Overflow

python - "Error reading file: fdopen() failed unexpectedly" How to solve? When read too many files after - Sta

javascript - Passing Data to Bootstrap Modals - Stack Overflow

javascript - What's the easiest way to parse the anchor out of the current page's location? - Stack Overflow

javascript - React, updating state in child component using props after async call in parent component - Stack Overflow

PayPal JavaScript API pass shipping address - Stack Overflow

最新文章

Win7各正式版下载地址和SHA验证

怎么样把中文版的Windows7改成英文版的Windows7

Win7系统笔记本蓝牙打开指南：详细步骤助你轻松连接

win7开机弹计算机,win7开机弹出Windows Installer窗口的解决方法

windows7虚拟机安装vmtools方法

javascript - Dynamic Nested Form link_to_add called twice - Stack Overflow

javascript - Multiple draggable droppable and sortable jQuery - Stack Overflow

flowbite - Property 'FlowbiteInstances' does not exist on type 'Window & typeof globalThis&#

redirect - Is the default value of FORCE_SSL_ADMIN documented?

javascript - Matching an argument with a function in Jasmine - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

编程频道|软件玩家 - 软件改变生活！

machine learning - Getting Unicode error while saving to disk in distiset - Stack Overflow

更多相关文章

machine learning - Getting Unicode error while saving to disk in distiset - Stack Overflow

发表评论

推荐文章

stdout - Haskell program hanging when interacting with another program - Stack Overflow

visual studio code - vscode auto push to all remotes - Stack Overflow

javascript - Advantage of COMET over long request polling? - Stack Overflow

javascript - How to create a horizontal legend with d3.js - Stack Overflow

javascript - How to get Jquery to ShowHide Dynamic Div Element? - Stack Overflow

热门文章

testing - MockWebServer start on Android API 35 cause - Stack Overflow

javascript - return json object - Stack Overflow

javascript - Spacing between label and input box - Stack Overflow

javascript - purpose of &quot;(function() {&quot; at the start of scripts (GreaseMonkey only?) - Stack Overflow

typescript - class-transformer Transform nested class - Stack Overflow

python - &quot;Error reading file: fdopen() failed unexpectedly&quot; How to solve? When read too many files after - Sta

javascript - Passing Data to Bootstrap Modals - Stack Overflow

javascript - What&#39;s the easiest way to parse the anchor out of the current page&#39;s location? - Stack Overflow

javascript - React, updating state in child component using props after async call in parent component - Stack Overflow

PayPal JavaScript API pass shipping address - Stack Overflow

最新文章

Win7各正式版下载地址和SHA验证

怎么样把中文版的Windows7改成英文版的Windows7

Win7系统笔记本蓝牙打开指南：详细步骤助你轻松连接

win7开机弹计算机,win7开机弹出Windows Installer窗口的解决方法

windows7虚拟机安装vmtools方法

javascript - Dynamic Nested Form link_to_add called twice - Stack Overflow

javascript - Multiple draggable droppable and sortable jQuery - Stack Overflow

flowbite - Property &#39;FlowbiteInstances&#39; does not exist on type &#39;Window &amp; typeof globalThis&#

redirect - Is the default value of FORCE_SSL_ADMIN documented?

javascript - Matching an argument with a function in Jasmine - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

javascript - purpose of "(function() {" at the start of scripts (GreaseMonkey only?) - Stack Overflow

python - "Error reading file: fdopen() failed unexpectedly" How to solve? When read too many files after - Sta

javascript - What's the easiest way to parse the anchor out of the current page's location? - Stack Overflow

flowbite - Property 'FlowbiteInstances' does not exist on type 'Window & typeof globalThis&#