Forum › Forums › New users › New Users and General Questions › Convert PDF file to searchable one
- This topic has 13 replies, 7 voices, and was last updated May 11-7:01 am by andfree.
-
AuthorPosts
-
May 6, 2023 at 7:10 am #106094Member
andfree
Hi. Is there any application for converting PDF files to searchable ones? Here is a reply to an older question of mine. I tried with “pdftotext” command, but:
bash: pdftotext: command not foundMay 6, 2023 at 7:15 am #106095MemberRJP
May 6, 2023 at 7:42 am #106096Memberlgj100
May 6, 2023 at 3:05 pm #106119MemberBernd
::ocrmypdf is wonderful. Small limitation, it only works in the terminal.
find . -printf '%p' -name '*.pdf' -exec ocrmypdf -l deu --rotate-pages '{}' '{}' \;Searches in a folder and processes all pdf files and adds an additional text layer.
May 8, 2023 at 7:43 am #106253Memberandfree
::Thanks for all the replies. I installed poppler-utils and ran “pdftotext”, but the extracted content appears as squares containing “000C”, as you can see in the attachment (the language is greek).
ocrmypdf seems to work fine, but, when I copy text from the created searchable pdf and paste it in a text editor or in the LibreOffice Writer, they don’t seem to recognize that the text is greek, and it’s displayed in latin/english characters.- This reply was modified 3 days, 11 hours ago by andfree.
Attachments:
May 8, 2023 at 8:29 am #106256MemberRobin
::in the LibreOffice Writer, they don’t seem to recognize that the text is greek
Have you made sure you have set the proper language in your target document in libre-office before pasting? See: https://help.libreoffice.org/7.4/en-GB/text/shared/guide/language_select.html?&DbPAR=WRITER&System=UNIX
and paste it in a text editor
In geany you’ll need to set the character encoding from the document menu according to the encoding used in the pdf file before pasting, to display the characters properly.
Windows is like a submarine. Open a window and serious problems will start.
May 8, 2023 at 1:56 pm #106282MemberXunzi_23
::Seems you may need to be pragmatic. MasterPdfEditor free version 4.3 is still available
and can do pretty much everything pdf related.May 9, 2023 at 4:27 am #106304Memberandfree
::Thanks for the replies.
In geany you’ll need to set the character encoding from the document menu according to the encoding used in the pdf file before pasting, to display the characters properly.
I don’t know the encoding used in the pdf file. I tried with ISO-8859-7, WINDOWS-1253 & UTF-8, but this did’t help.
MasterPdfEditor free version 4.3 is still available
Unfortunately, I can’t find a file for 32-bit.
May 9, 2023 at 10:38 am #106311MemberPPC
::Unfortunately, I can’t find a file for 32-bit.
Try here (I’m not sure if its the free version):
https://code-industry.net/public/master-pdf-editor-4.3.89_i386.deb
Edit: I installed the 64bits version (it’s an handy app to have around.) It’s the last free version. Also, in Package Installer, there “should” be an entry that allows you to install this application.
- This reply was modified 2 days, 7 hours ago by PPC.
- This reply was modified 1 day, 9 hours ago by PPC.
May 10, 2023 at 7:09 am #106394Memberandfree
::Try (…)
https://code-industry.net/public/master-pdf-editor-4.3.89_i386.debThank you, but there are dependency problems:
$ sudo dpkg -i master-pdf-editor-4.3.89_i386.deb Selecting previously unselected package master-pdf-editor. (Reading database ... 106753 files and directories currently installed.) Preparing to unpack master-pdf-editor-4.3.89_i386.deb ... Unpacking master-pdf-editor (4.3.89) ... dpkg: dependency problems prevent configuration of master-pdf-editor: master-pdf-editor depends on libqt4-svg (>= 4.6.4); however: Package libqt4-svg is not installed. master-pdf-editor depends on libqt4-network (>= 4.6.4); however: Package libqt4-network is not installed. master-pdf-editor depends on libqtcore4 (>= 4.6.4); however: Package libqtcore4 is not installed. master-pdf-editor depends on libqtgui4 (>= 4.8.4); however: Package libqtgui4 is not installed. dpkg: error processing package master-pdf-editor (--install): dependency problems - leaving unconfigured Processing triggers for hicolor-icon-theme (0.17-2) ... Processing triggers for desktop-file-utils (0.26-1) ... Processing triggers for mailcap (3.69) ... Errors were encountered while processing: master-pdf-editor$ sudo apt install libqt4-svg Reading package lists... Done Building dependency tree... Done Reading state information... Done Package libqt4-svg is not available, but is referred to by another package. This may mean that the package is missing, has been obsoleted, or is only available from another source E: Package 'libqt4-svg' has no installation candidate$ sudo apt install libqt4-network Reading package lists... Done Building dependency tree... Done Reading state information... Done Package libqt4-network is not available, but is referred to by another package. This may mean that the package is missing, has been obsoleted, or is only available from another source E: Package 'libqt4-network' has no installation candidate$ sudo apt install libqtcore4 Reading package lists... Done Building dependency tree... Done Reading state information... Done Package libqtcore4 is not available, but is referred to by another package. This may mean that the package is missing, has been obsoleted, or is only available from another source However the following packages replace it: qtchooser libqt5core5a E: Package 'libqtcore4' has no installation candidate$ sudo apt install libqtgui4 Reading package lists... Done Building dependency tree... Done Reading state information... Done Package libqtgui4 is not available, but is referred to by another package. This may mean that the package is missing, has been obsoleted, or is only available from another source E: Package 'libqtgui4' has no installation candidateMay 10, 2023 at 7:31 am #106395MemberXunzi_23
::You will need an older version of antiX to provide the needed qt version.
Sorry wanted to provide the below link yesterday.https://www.linuxuprising.com/2019/04/download-master-pdf-editor-4-for-linux.html
Hope PPC can give you more advice. Login from crazily hot Bangkok is very difficult.
May 10, 2023 at 10:19 am #106401MemberPPC
::Sorry, but like Xunzi said, that 32bits .deb file is so old that it’s dependencies are no longer available under modern Linux versions (that currently supports QT5, not QT4).
In a 32bits system the best course of action to get “searchable text” from a pdf file would be:
– if there is text in the pdf (and not merely an image, that includes text as it may include a drawing, etc), you can probably search through it using searchmonkey
– if there is no text in the pdf, you first do need to perform OCR on it, using previously mentioned tools, to convert that image to real text, that can then be saved in a text document. There are also tools to extract images from the .pdf file. You can then perform OCR on those images.32bits PC’s are literally a dying breed- they are just too old, and there’s only a limited number of things they can do, and their compatibility with modern software is slim- thankfully, antiX has a 32bits version, as does Firefox, firefox-esr, seamonkey and also libreoffice and openoffice (openoffice is much faster and uses much less resources, on a 32bits system- I run it from the appimage provided by a very ingenious forum user, and enable it’s quickstarter – which makes office documents load almost instantly, on my 20 years old laptop). My point – one has to adapt to the tools the arch has available… Fortunatly we can still run modern web browsers, office suites, e-mail clients, media players, pdf and ebook readers, music players, etc… Most of that runs and runs well and even fast, most of the times. What does not run… simply does not run 🙁
P.
- This reply was modified 1 day, 5 hours ago by PPC.
May 10, 2023 at 1:46 pm #106418MemberXunzi_23
::As I am pragmatic, I would use an old Linux version to work on the pdf document.
As long as internet is not involved nothing to worry about, unless the document is
so weird you are unable to open it with masterpdfeditor. The version 4.3 is free to
use on linux. Later versions are purchase to unlock features ans have a print without
watermark.May 11, 2023 at 7:01 am #106482Memberandfree
::Thanks again for all the replies.
Have you made sure you have set the proper language in your target document in libre-office before pasting?
I tried Tools -> Language -> For Selection -> Greek, then I pasted again, but it didn’t work.
openoffice is much faster and uses much less resources, on a 32bits system- I run it from the appimage provided by a very ingenious forum user, and enable it’s quickstarter – which makes office documents load almost instantly, on my 20 years old laptop
Is this appimage available for download?
I would use an old Linux version to work on the pdf document.
Any idea which version could be compatible?
-
AuthorPosts
- You must be logged in to reply to this topic.
