admin管理员组文章数量:1287555
Using pdf.js, i have made a simple function for extract the raw text from a pdf:
async getPdfText(path){
const pdf = await PDFJS.getDocument(path);
const pagePromises = [];
for (let j = 1; j <= pdf.numPages; j++) {
const page = pdf.getPage(j);
pagePromises.push(page.then((page) => {
const textContent = page.getTextContent();
return textContent.then((text) => {
return text.items.map((s) => s.str).join('');
});
}));
}
const texts = await Promise.all(pagePromises);
return texts.join('');
}
// usage
getPdfText("C:\\my.pdf").then((text) => { console.log(text); });
however i can't find a way for extract correctly the new lines, all the text is extracted in only one line.
How extract correctly the text? i want extract the text in the same way as on desktop pc:
Open the pdf (doble click on the file) -> select all text (CTRL + A) -> copy the selected text (CTRL + C) -> paste the copied text (CTRL + V)
Using pdf.js, i have made a simple function for extract the raw text from a pdf:
async getPdfText(path){
const pdf = await PDFJS.getDocument(path);
const pagePromises = [];
for (let j = 1; j <= pdf.numPages; j++) {
const page = pdf.getPage(j);
pagePromises.push(page.then((page) => {
const textContent = page.getTextContent();
return textContent.then((text) => {
return text.items.map((s) => s.str).join('');
});
}));
}
const texts = await Promise.all(pagePromises);
return texts.join('');
}
// usage
getPdfText("C:\\my.pdf").then((text) => { console.log(text); });
however i can't find a way for extract correctly the new lines, all the text is extracted in only one line.
How extract correctly the text? i want extract the text in the same way as on desktop pc:
Open the pdf (doble click on the file) -> select all text (CTRL + A) -> copy the selected text (CTRL + C) -> paste the copied text (CTRL + V)
Share Improve this question asked Feb 12, 2019 at 7:52 ar099968ar099968 7,56715 gold badges73 silver badges138 bronze badges1 Answer
Reset to default 13I know the question is more than a year old, but in case anyone has the same problem.
As this post said :
In PDF there no such thing as controlling layout using control chars such as '\n' -- glyphs in PDF positioned using exact coordinates. Use text y-coordinate (can be extracted from transform matrix) to detect a line change.
So with pdf.js, you can use the transform
property of the textContent.items
object. Specifically box 5 of the table. If this value changes, then it means that there is a new line
Here's my code :
page.getTextContent().then(function (textContent) {
var textItems = textContent.items;
var finalString = "";
var line = 0;
// Concatenate the string of the item to the final string
for (var i = 0; i < textItems.length; i++) {
if (line != textItems[i].transform[5]) {
if (line != 0) {
finalString +='\r\n';
}
line = textItems[i].transform[5]
}
var item = textItems[i];
finalString += item.str;
}
var node = document.getElementById('output');
node.value = finalString;
});
As weird as it sounds, instead of using tranform
, you can also use the fontName
property. With each new line, the fontName changes.
本文标签: javascriptpdfjs get raw text from pdf with correct newlinewithespaceStack Overflow
版权声明:本文标题:javascript - pdfjs: get raw text from pdf with correct newlinewithespace - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741224058a2361525.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论