admin管理员组

文章数量:1122832

I am using Java - Tess4j-5.13.0.jar to read a pdf containing a table like image. Its the first time using Tess4j/tesseract.

Tess4j is located here :

The pdf I am trying to convert :

The problem is when the pdf image is processed it only returns the first heading line and the rest is ignored.

The pdf contains one image that looks like a table with a heading. The heading is returned but the rest of the table is ignored. One extra string is also returned but I do not know where that comes from. "-ma_———"

This is my code that I used.

public static void main(String[] args) throws IOException, TesseractException {
    // TODO Auto-generated method stub
    File imageFile = new File("C:/Users/DFDS_Y1_2025.pdf");
    ITesseract instance = new Tesseract(); // JNA Interface Mapping
    instance.setDatapath("C:/Users/Tess4J/tessdata");
    instance.setLanguage("eng");
  
    //List<RenderedFormat> renderFormats = new ArrayList<RenderedFormat>();
    //renderFormats.add(RenderedFormat.PDF);
    //instance.createDocumentsWithResults(imageFile,null,"C:/Users/DFDS_Y1_2025_out2", renderFormats, TessPageIteratorLevel.RIL_BLOCK);

    try {
  
        String result = instance.doOCR(imageFile);
        System.out.println(result);
    } catch (TesseractException e) {
        System.out.println("ERROR");
        System.err.println(e.getMessage());
    }   }}

The result that gets printed to the console is:

Destination Rate O-1OT Rate 10.01-17T Full rate

-ma_———

So its the heading plus for some reason this string as well -ma_———

I was expecting all the other rows of data to be returned.

I have tried first extracting the image from the pdf and made it gray scale and then instead of processing the pdf I used the image file as input but I got the same result. I went thought the online examples the code is similar to mine, I cant see what I have to do to get the rest of the data.

I am using eclipse an this is the console output when I run the code :

I know this can be done using tesseract as I tested it here : .html using the scribe UI based on tesseract. /

When the pdf is uploaded to scribe it gets all the text data in the image.

I am not sure what I am doing wrong, the pdf is clear and should work. Should the image or pdf be preprocessed or what am I doing wrong.

Please let me know if you need more info.

Any help would be appreciated.

本文标签: