Convert PDF file to searchable one

Forum Forums New users New Users and General Questions Convert PDF file to searchable one

  • This topic has 13 replies, 7 voices, and was last updated May 11-7:01 am by andfree.
Viewing 14 posts - 1 through 14 (of 14 total)
  • Author
    Posts
  • #106094
    Member
    andfree

      Hi. Is there any application for converting PDF files to searchable ones? Here is a reply to an older question of mine. I tried with “pdftotext” command, but:
      bash: pdftotext: command not found

      #106095
      Member
      RJP
        Helpful
        Up
        2
        ::

        Install poppler-utils

        sudo apt install poppler-utils

        #106096
        Member
        lgj100
          Helpful
          Up
          2
          ::

          ocrmypdf works great.

          #106119
          Member
          Bernd
            Helpful
            Up
            2
            ::

            ocrmypdf is wonderful. Small limitation, it only works in the terminal.

            
            find . -printf '%p' -name '*.pdf' -exec ocrmypdf -l deu --rotate-pages '{}' '{}' \;

            Searches in a folder and processes all pdf files and adds an additional text layer.

            #106253
            Member
            andfree
              Helpful
              Up
              0
              ::

              Thanks for all the replies. I installed poppler-utils and ran “pdftotext”, but the extracted content appears as squares containing “000C”, as you can see in the attachment (the language is greek).
              ocrmypdf seems to work fine, but, when I copy text from the created searchable pdf and paste it in a text editor or in the LibreOffice Writer, they don’t seem to recognize that the text is greek, and it’s displayed in latin/english characters.

              • This reply was modified 3 days, 11 hours ago by andfree.
              #106256
              Member
              Robin
                Helpful
                Up
                0
                ::

                in the LibreOffice Writer, they don’t seem to recognize that the text is greek

                Have you made sure you have set the proper language in your target document in libre-office before pasting? See: https://help.libreoffice.org/7.4/en-GB/text/shared/guide/language_select.html?&DbPAR=WRITER&System=UNIX

                and paste it in a text editor

                In geany you’ll need to set the character encoding from the document menu according to the encoding used in the pdf file before pasting, to display the characters properly.

                Windows is like a submarine. Open a window and serious problems will start.

                #106282
                Member
                Xunzi_23
                  Helpful
                  Up
                  0
                  ::

                  Seems you may need to be pragmatic. MasterPdfEditor free version 4.3 is still available
                  and can do pretty much everything pdf related.

                  #106304
                  Member
                  andfree
                    Helpful
                    Up
                    0
                    ::

                    Thanks for the replies.

                    In geany you’ll need to set the character encoding from the document menu according to the encoding used in the pdf file before pasting, to display the characters properly.

                    I don’t know the encoding used in the pdf file. I tried with ISO-8859-7, WINDOWS-1253 & UTF-8, but this did’t help.

                    MasterPdfEditor free version 4.3 is still available

                    Unfortunately, I can’t find a file for 32-bit.

                    #106311
                    Member
                    PPC
                      Helpful
                      Up
                      0
                      ::

                      Unfortunately, I can’t find a file for 32-bit.

                      Try here (I’m not sure if its the free version):

                      https://code-industry.net/public/master-pdf-editor-4.3.89_i386.deb

                      Edit: I installed the 64bits version (it’s an handy app to have around.) It’s the last free version. Also, in Package Installer, there “should” be an entry that allows you to install this application.

                      • This reply was modified 2 days, 7 hours ago by PPC.
                      • This reply was modified 1 day, 9 hours ago by PPC.
                      #106394
                      Member
                      andfree
                        Helpful
                        Up
                        0
                        ::

                        Try (…)
                        https://code-industry.net/public/master-pdf-editor-4.3.89_i386.deb

                        Thank you, but there are dependency problems:

                        $ sudo dpkg -i master-pdf-editor-4.3.89_i386.deb
                        Selecting previously unselected package master-pdf-editor.
                        (Reading database ... 106753 files and directories currently installed.)
                        Preparing to unpack master-pdf-editor-4.3.89_i386.deb ...
                        Unpacking master-pdf-editor (4.3.89) ...
                        dpkg: dependency problems prevent configuration of master-pdf-editor:
                         master-pdf-editor depends on libqt4-svg (>= 4.6.4); however:
                          Package libqt4-svg is not installed.
                         master-pdf-editor depends on libqt4-network (>= 4.6.4); however:
                          Package libqt4-network is not installed.
                         master-pdf-editor depends on libqtcore4 (>= 4.6.4); however:
                          Package libqtcore4 is not installed.
                         master-pdf-editor depends on libqtgui4 (>= 4.8.4); however:
                          Package libqtgui4 is not installed.
                        
                        dpkg: error processing package master-pdf-editor (--install):
                         dependency problems - leaving unconfigured
                        Processing triggers for hicolor-icon-theme (0.17-2) ...
                        Processing triggers for desktop-file-utils (0.26-1) ...
                        Processing triggers for mailcap (3.69) ...
                        Errors were encountered while processing:
                         master-pdf-editor
                        $ sudo apt install libqt4-svg
                        Reading package lists... Done
                        Building dependency tree... Done
                        Reading state information... Done
                        Package libqt4-svg is not available, but is referred to by another package.
                        This may mean that the package is missing, has been obsoleted, or
                        is only available from another source
                        
                        E: Package 'libqt4-svg' has no installation candidate
                        $ sudo apt install libqt4-network
                        Reading package lists... Done
                        Building dependency tree... Done
                        Reading state information... Done
                        Package libqt4-network is not available, but is referred to by another package.
                        This may mean that the package is missing, has been obsoleted, or
                        is only available from another source
                        
                        E: Package 'libqt4-network' has no installation candidate
                        $ sudo apt install libqtcore4
                        Reading package lists... Done
                        Building dependency tree... Done
                        Reading state information... Done
                        Package libqtcore4 is not available, but is referred to by another package.
                        This may mean that the package is missing, has been obsoleted, or
                        is only available from another source
                        However the following packages replace it:
                          qtchooser libqt5core5a
                        
                        E: Package 'libqtcore4' has no installation candidate
                        $ sudo apt install libqtgui4
                        Reading package lists... Done
                        Building dependency tree... Done
                        Reading state information... Done
                        Package libqtgui4 is not available, but is referred to by another package.
                        This may mean that the package is missing, has been obsoleted, or
                        is only available from another source
                        
                        E: Package 'libqtgui4' has no installation candidate
                        #106395
                        Member
                        Xunzi_23
                          Helpful
                          Up
                          0
                          ::

                          You will need an older version of antiX to provide the needed qt version.
                          Sorry wanted to provide the below link yesterday.

                          https://www.linuxuprising.com/2019/04/download-master-pdf-editor-4-for-linux.html

                          Hope PPC can give you more advice. Login from crazily hot Bangkok is very difficult.

                          #106401
                          Member
                          PPC
                            Helpful
                            Up
                            0
                            ::

                            Sorry, but like Xunzi said, that 32bits .deb file is so old that it’s dependencies are no longer available under modern Linux versions (that currently supports QT5, not QT4).
                            In a 32bits system the best course of action to get “searchable text” from a pdf file would be:
                            – if there is text in the pdf (and not merely an image, that includes text as it may include a drawing, etc), you can probably search through it using searchmonkey
                            – if there is no text in the pdf, you first do need to perform OCR on it, using previously mentioned tools, to convert that image to real text, that can then be saved in a text document. There are also tools to extract images from the .pdf file. You can then perform OCR on those images.

                            32bits PC’s are literally a dying breed- they are just too old, and there’s only a limited number of things they can do, and their compatibility with modern software is slim- thankfully, antiX has a 32bits version, as does Firefox, firefox-esr, seamonkey and also libreoffice and openoffice (openoffice is much faster and uses much less resources, on a 32bits system- I run it from the appimage provided by a very ingenious forum user, and enable it’s quickstarter – which makes office documents load almost instantly, on my 20 years old laptop). My point – one has to adapt to the tools the arch has available… Fortunatly we can still run modern web browsers, office suites, e-mail clients, media players, pdf and ebook readers, music players, etc… Most of that runs and runs well and even fast, most of the times. What does not run… simply does not run 🙁

                            P.

                            • This reply was modified 1 day, 5 hours ago by PPC.
                            #106418
                            Member
                            Xunzi_23
                              Helpful
                              Up
                              0
                              ::

                              As I am pragmatic, I would use an old Linux version to work on the pdf document.

                              As long as internet is not involved nothing to worry about, unless the document is
                              so weird you are unable to open it with masterpdfeditor. The version 4.3 is free to
                              use on linux. Later versions are purchase to unlock features ans have a print without
                              watermark.

                              #106482
                              Member
                              andfree
                                Helpful
                                Up
                                0
                                ::

                                Thanks again for all the replies.

                                Have you made sure you have set the proper language in your target document in libre-office before pasting?

                                I tried Tools -> Language -> For Selection -> Greek, then I pasted again, but it didn’t work.

                                openoffice is much faster and uses much less resources, on a 32bits system- I run it from the appimage provided by a very ingenious forum user, and enable it’s quickstarter – which makes office documents load almost instantly, on my 20 years old laptop

                                Is this appimage available for download?

                                I would use an old Linux version to work on the pdf document.

                                Any idea which version could be compatible?

                              Viewing 14 posts - 1 through 14 (of 14 total)
                              • You must be logged in to reply to this topic.