Instantly Extract Hyperlinks From PDF Files & Export Them to PDF, DOC, DOCX
Are you finding solutions on how to preserve or retain links from PDF files or export hyperlinks from PDF file to text file for future use?
A Portable Document Format or PDF is a premier file format for sharing information / report or any official / legal documents. Sometimes, these PDF files contain some hyperlink text or URLs. Now, you want to extract all the URLs from PDF files to preserve them or retain them for future reference.
In this blog, I am going to describe the working of an remarkable tool designed by SysTools to extract hyperlinks from PDF and save them in a PDF/ DOC/ DOCX file. Also, we will see how we can use Python language to extract URLs from PDF.
Extract Hyperlinks From PDF Files Using Python PyPDF2 Lib.
Step-1: Install PyPDF2 on your local system by typing pip install PyPDF2 in the command shell.
Step-2: Import PyPDF2.
Step-3: Open the PDF in Binary mode and it recognizes links in the file.
Step-4: Define a function to extract the hyperlink for a particular PDF page.
Step-5: Iterate for all the pages and extract the text using extractText() function.
Step-6: To extract the hyperlinks from PDF, a Pattern Matching Concept in Python is used. Now you have to import re to find the pattern using regular expressions.
Step-7: Finding the pattern that matches with http:// or https:// using findall(regex, string).
Step-8: If any URL/ link found, return the URL by printing it on the screen.
Here is the Python Code to Extract Links From PDF File
# Importing packages
# Open your File in the Command
file = open(“newfile.pdf”, ‘rb’)
readPDF = PyPDF2.PdfFileReader(file)
#Find all the String that matches with the pattern
regex = r”(https?://\S+)”
url = re.findall(regex,string)
for url in url:
# Iterating for all the pages of File
for page_no in range(readPDF.numPages):
#Extract the text from the page
text = page.extractText()
# Print all URL
# Close the file
Well, the above method can be too much programmatic for some users, so to ease your task you can follow the automatic solution.
How to Automatically Extract and Export Hyperlinks From PDF Files
Since it’s an automated tool with a well-defined interface, that does not require you to have expertise or technical knowledge to run the software.
Step-1: Download the SysTools PDF Extractor on your system.
Step-2: Click on “Add File(s)/ Add Folder” to browse for PDF documents from your system. You can change the saving location of PDFs as well using “Change”. Click on “Next”.
Step-3: Here, you have to choose the “Hyperlinks” option to extract links from PDF.
Step-4: To export hyperlinks from PDF, the tool gives you 3 file formats (PDF, DOC, DOCX) in which you can save all your extracted URLs. Select any of them.
Step-5: Moreover, you can do the page settings to specify the PDF pages from which you want to extract hyperlinks.
Step-6: At last, click on the “Extract” button.
Other Prominent Features of This Automated PDF Utility
Other than Hyperlinks, the software is capable of extracting different kinds of objects from PDF files. You can extract these following PDF objects:
- Inline/ Embedded Images
- Attached files or Portfolio
- Extract PDF Bookmarks
- Comments and Highlighted Text
- Rich Media
The tool does not need owner / permission to be able to process the PDF files. Also, do note that there will be no change in the original formatting of your PDF files.
In this article, two methods have been explained to extact links from PDF using Python programming and automated PDF link extractor tool by SysTools. Both these methods have their advantages. Using python is free but can be technical for a non-technical user. Automated tool is recommended to the professionals or who are working with a pool of PDF files. You can try the free version of the tool that will extract limited URLs from PDF.