Extracting Text from PDFs in Python with PyMuPDF (fitz)

2 min read .

When working with PDFs, one common task is extracting text for further processing, analysis, or simply converting the content into a more accessible format. Python provides a powerful library called PyMuPDF, also known as fitz, that allows you to easily extract text from PDF files.

In this post, we’ll walk through a simple Python script that extracts text from each page of a PDF file and saves it to individual text files.

Prerequisites

Before we begin, you’ll need to have PyMuPDF installed. If you haven’t installed it yet, you can do so using pip:

pip install pymupdf

The Script

Below is a Python script that takes a PDF file as input and extracts the text from each page, saving it to separate .txt files:

import fitz  # PyMuPDF
import os

def extract_text_from_pdf(pdf_path, output_dir):
    # Open the PDF file
    pdf_document = fitz.open(pdf_path)
    
    # Create the output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    
    # Extract text from each page
    for page_num in range(len(pdf_document)):
        page = pdf_document.load_page(page_num)
        text = page.get_text()
        
        # Define the filename for each page
        md_file_path = os.path.join(output_dir, f"page_{page_num + 1}.txt")
        
        # Save the extracted text to a file
        with open(md_file_path, 'w', encoding='utf-8') as file:
            file.write(text)
        
        print(f"Text from page {page_num + 1} saved to {md_file_path}")

    # Close the PDF file after processing
    pdf_document.close()

# Replace with the path to your PDF file and desired output directory
extract_text_from_pdf("data/file.pdf", "filename")

How the Script Works

  1. Opening the PDF: The script begins by opening the PDF file using fitz.open(pdf_path). This function loads the PDF and prepares it for processing.

  2. Creating the Output Directory: The script checks if the specified output directory exists. If not, it creates the directory using os.makedirs(output_dir, exist_ok=True).

  3. Extracting Text: The script then loops through each page of the PDF, extracting the text using page.get_text(). The extracted text is then saved to a .txt file named according to the page number.

  4. Saving the Text: The script writes the extracted text to a file with UTF-8 encoding to ensure that all characters are properly handled.

  5. Closing the PDF: After processing all the pages, the script closes the PDF file with pdf_document.close() to free up resources.

Customization

You can easily customize this script to suit your needs. For example, you could modify the output format, process only specific pages, or extract additional information such as images or annotations.

Conclusion

With just a few lines of code, you can leverage the power of PyMuPDF (fitz) to extract text from PDFs in Python. This script is a great starting point for any project that involves working with PDF documents.

Tags:
Python

See Also

chevron-up