javascript - pdfjs: get raw text from pdf with correct newlinewithespace - Stack Overflow

IT技术

更新时间：2025-03-061

admin管理员组
文章数量:1287555

Using pdf.js, i have made a simple function for extract the raw text from a pdf:

async getPdfText(path){

    const pdf = await PDFJS.getDocument(path);

    const pagePromises = [];
    for (let j = 1; j <= pdf.numPages; j++) {
        const page = pdf.getPage(j);

        pagePromises.push(page.then((page) => {
            const textContent = page.getTextContent();
            return textContent.then((text) => {
                return text.items.map((s) =>  s.str).join('');
            });
        }));
    }

    const texts = await Promise.all(pagePromises);
    return texts.join('');
}

// usage
getPdfText("C:\\my.pdf").then((text) => { console.log(text); });

however i can't find a way for extract correctly the new lines, all the text is extracted in only one line.

How extract correctly the text? i want extract the text in the same way as on desktop pc:

Open the pdf (doble click on the file) -> select all text (CTRL + A) -> copy the selected text (CTRL + C) -> paste the copied text (CTRL + V)

Using pdf.js, i have made a simple function for extract the raw text from a pdf:

async getPdfText(path){

    const pdf = await PDFJS.getDocument(path);

    const pagePromises = [];
    for (let j = 1; j <= pdf.numPages; j++) {
        const page = pdf.getPage(j);

        pagePromises.push(page.then((page) => {
            const textContent = page.getTextContent();
            return textContent.then((text) => {
                return text.items.map((s) =>  s.str).join('');
            });
        }));
    }

    const texts = await Promise.all(pagePromises);
    return texts.join('');
}

// usage
getPdfText("C:\\my.pdf").then((text) => { console.log(text); });

however i can't find a way for extract correctly the new lines, all the text is extracted in only one line.

How extract correctly the text? i want extract the text in the same way as on desktop pc:

Open the pdf (doble click on the file) -> select all text (CTRL + A) -> copy the selected text (CTRL + C) -> paste the copied text (CTRL + V)

Share Improve this question asked Feb 12, 2019 at 7:52 ar099968 7,56715 gold badges73 silver badges138 bronze badges

Add a ment |

1 Answer 1

Sorted by: Reset to default 13

I know the question is more than a year old, but in case anyone has the same problem.

As this post said :

In PDF there no such thing as controlling layout using control chars such as '\n' -- glyphs in PDF positioned using exact coordinates. Use text y-coordinate (can be extracted from transform matrix) to detect a line change.

So with pdf.js, you can use the transform property of the textContent.items object. Specifically box 5 of the table. If this value changes, then it means that there is a new line

Here's my code :

            page.getTextContent().then(function (textContent) {
                var textItems = textContent.items;
                var finalString = "";
                var line = 0;

                // Concatenate the string of the item to the final string
                for (var i = 0; i < textItems.length; i++) {
                    if (line != textItems[i].transform[5]) {
                        if (line != 0) {
                            finalString +='\r\n';
                        }

                        line = textItems[i].transform[5]
                    }                     
                    var item = textItems[i];

                    finalString += item.str;
                }

                var node = document.getElementById('output');
                node.value = finalString;
            });

As weird as it sounds, instead of using tranform, you can also use the fontName property. With each new line, the fontName changes.

本文标签： javascriptpdfjs get raw text from pdf with correct newlinewithespaceStack Overflow

版权声明：本文标题：javascript - pdfjs: get raw text from pdf with correct newlinewithespace - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1741224058a2361525.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

javascript - pdfjs: get raw text from pdf with correct newlinewithespace - Stack Overflow

1 Answer 1

更多相关文章