Scan or convert to searchable pdf

Forum Forums New users New Users and General Questions Scan or convert to searchable pdf

Tagged: 

  • This topic has 8 replies, 4 voices, and was last updated Apr 15-11:51 am by clemency.
Viewing 9 posts - 1 through 9 (of 9 total)
  • Author
    Posts
  • #2089
    Member
    andfree

      Hello everybody. All the best for the new forum. My first question in here:
      Is there any free way to scan documents as searchable pdf files? Alternatively, is there any free way to convert non-searchable files into searchable ones?
      Thank you.

      #2094
      Moderator
      caprea

        Not sure , if this is what you mean, but there’s “gscan2pdf” in the repo.You probably have to install the tesseract-ocr for your language.

        • This reply was modified 6 years, 8 months ago by caprea.
        #2114
        Member
        andfree

          Thanks for reply. gscan2pdf is already installed, but the quality of the scans is very bad. I can’t find how to improve it.

          #2115
          Forum Admin
          SamK

            …the quality of the scans is very bad. I can’t find how to improve it.

            I can’t vouch for this link as I have not used the technique, but it does look like it might help you along the way.
            http://quaintproject.wordpress.com/2015/09/30/searchable-pdf-from-scan-under-linux/

            #2116
            Member
            andfree

              Thank you, but I don’t see the link.

              #2179
              Forum Admin
              SamK

                Thank you, but I don’t see the link.

                Appended absent link to previous post and also added here http://quaintproject.wordpress.com/2015/09/30/searchable-pdf-from-scan-under-linux/

                #2190
                Member
                andfree

                  I can’t vouch for this link as I have not used the technique, but it does look like it might help you along the way.
                  image to searchable pdf under linux

                  Thanks for the link. It worked for me, but not perfectly. The produced pdf file seems to be good, but there are character recognition mistakes, so neither searching nor copy-paste work perfectly. I only have tested it with greek text, not english. For greek language, I installed tesseract-ocr-ell 3.02-2 from SPM.

                  #2193
                  Forum Admin
                  SamK

                    It worked for me, but not perfectly.

                    Over the years I have tried various OCR programs mainly commercial pay-to-use-licenced types. None of them acheived 100% accurate results. I think it is not possible in all circumstances, simply because of the wide number of variables that can influence process and affect the result.

                    It is probable that everyone will have their own threshold of what represents an acceptable outcome.

                    An idea you might like to try…

                    If the text content of a PDF document you want to make searchable is more important than preserving 100% its layout, you might try a multiple stage approach:

                    1. Extract the content of the unsearchable PDF as plain text
                    2. Edit the plain text document to remove unwanted parts and make corrections
                    3. Save the edited text file in searchable PDF format

                    Step 1
                    Extract a range of sequential pages within a PDF file and output a single merged text file.
                    Use a command something like the following:
                    pdftotext -layout -f 13 -l 14 input.pdf extracted-pages.txt
                    -layout attempts to maintain the original physical layout of the text in the PDF document
                    -f Specifies the first page to extract
                    -l Specifies the last page to extract

                    Step 2
                    Edit extracted-pages.txt using your preferred text editor or LibreOffice Writer (LOW)

                    Step 3
                    Save the edited version of extracted-pages.txt as a searchable PDF file.
                    LibreOffice Writer seems to do this OK

                    Step 3 can be done via the command used to start LOW.
                    Have a look at the foot of this page for how to do it.
                    http://help.libreoffice.org/Common/Starting_the_Software_With_Parameters
                    You might be able build up a collection of files via steps 1 and 2 then feed them to LOW to convert them automatically.

                    • This reply was modified 6 years, 8 months ago by SamK.
                    • This reply was modified 6 years, 8 months ago by SamK.
                    #9160
                    Member
                    clemency

                      Theres a project on github called paperwork. It is the best scanning solution i have found for linux except vuescan which is a paid application. It has OCR support but i have not tried it. You can give it a try. This is the link http://github.com/openpaperwork/paperwork.

                    Viewing 9 posts - 1 through 9 (of 9 total)
                    • You must be logged in to reply to this topic.