admin管理员组文章数量:1122832
I am using Java - Tess4j-5.13.0.jar to read a pdf containing a table like image. Its the first time using Tess4j/tesseract.
Tess4j is located here :
The pdf I am trying to convert :
The problem is when the pdf image is processed it only returns the first heading line and the rest is ignored.
The pdf contains one image that looks like a table with a heading. The heading is returned but the rest of the table is ignored. One extra string is also returned but I do not know where that comes from. "-ma_———"
This is my code that I used.
public static void main(String[] args) throws IOException, TesseractException {
// TODO Auto-generated method stub
File imageFile = new File("C:/Users/DFDS_Y1_2025.pdf");
ITesseract instance = new Tesseract(); // JNA Interface Mapping
instance.setDatapath("C:/Users/Tess4J/tessdata");
instance.setLanguage("eng");
//List<RenderedFormat> renderFormats = new ArrayList<RenderedFormat>();
//renderFormats.add(RenderedFormat.PDF);
//instance.createDocumentsWithResults(imageFile,null,"C:/Users/DFDS_Y1_2025_out2", renderFormats, TessPageIteratorLevel.RIL_BLOCK);
try {
String result = instance.doOCR(imageFile);
System.out.println(result);
} catch (TesseractException e) {
System.out.println("ERROR");
System.err.println(e.getMessage());
} }}
The result that gets printed to the console is:
Destination Rate O-1OT Rate 10.01-17T Full rate
-ma_———
So its the heading plus for some reason this string as well -ma_———
I was expecting all the other rows of data to be returned.
I have tried first extracting the image from the pdf and made it gray scale and then instead of processing the pdf I used the image file as input but I got the same result. I went thought the online examples the code is similar to mine, I cant see what I have to do to get the rest of the data.
I am using eclipse an this is the console output when I run the code :
I know this can be done using tesseract as I tested it here : .html using the scribe UI based on tesseract. /
When the pdf is uploaded to scribe it gets all the text data in the image.
I am not sure what I am doing wrong, the pdf is clear and should work. Should the image or pdf be preprocessed or what am I doing wrong.
Please let me know if you need more info.
Any help would be appreciated.
本文标签:
版权声明:本文标题:java - When using Tess4j to read a pdf image, only the first heading line is returned as a string result the rest of the image i 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736282379a1926626.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论