Read PDF Content using Selenium WebDriver

While doing automation with Selenium WebDriver, you may encounter a scenario where you need to read and verify the PDF content. In this post, we will the see the simplest way to read and verify the PDF content.

As there is no inbuilt feature in Selenium WebDriver to read the PDF content. We need to use a third party library for it.

There is a third party library available on internet called Apache PDFBOX which has ability to read the PDF content. It is an open source Java tool for working with PDF documents. We can easily read the PDF content using PDFBOX library by just writing few lines of code.

In the below example, we have already created a custom function: readPDFContent(). You can use this function in your Project. You just need to pass PDF file Path as a parameter to this function.

Read_PDF

Pre-requisite

1. Download Apache PDFBox JAR from here.
PDFbox-JAR

2. Add Selenium Standalone JAR and PDFBox JAR into the Build path of your JAVA Project.

Let’s automate following scenario to read the PDF content:

Scenario :

1. Launch Chrome Browser and Open URL : http://www.pdf995.com/samples/pdf.pdf

2. Read PDF Content and store it into a String variable.

3. Verify the content.

4. Close Browser.

Selenium Script :

import java.io.BufferedInputStream;
import java.io.InputStream;
import java.net.URL;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.testng.Assert;
import org.testng.annotations.BeforeTest;
import org.testng.annotations.Test;

public class PDFReader {

	WebDriver driver;

	@BeforeTest
	public void setUp() {
		System.setProperty("webdriver.chrome.driver", "C:\\gridsetup\\chromedriver.exe");
		driver = new ChromeDriver();
	}

	@Test
	public void verifyPDFContent() throws Exception {
		String url = "http://www.pdf995.com/samples/pdf.pdf";
		// Launch Chrome Browser and Open URL
		driver.get(url);
		// Read PDF Content and store it into a String variable.
		String pdfContent = readPDFContent(driver.getCurrentUrl());
		// Verify the content.
		Assert.assertTrue(
				pdfContent.contains("Pdf995 makes it easy and affordable to create professional-quality documents"));
		// Close Browser.
		driver.quit();
	}

	public String readPDFContent(String appUrl) throws Exception {
		URL url = new URL(appUrl);
		InputStream is = url.openStream();
		BufferedInputStream fileToParse = new BufferedInputStream(is);
		PDDocument document = null;
		String output = null;
		try {
			document = PDDocument.load(fileToParse);
			output = new PDFTextStripper().getText(document);
			System.out.println(output);
		} finally {
			if (document != null) {
				document.close();
			}
			fileToParse.close();
			is.close();
		}
		return output;
	}

}

In above script, we have also printed the PDF content on Eclipse Console. Kindly look at the below screen shot:

Read_PDF

If you really like the information provided above, please don’t forget to like us on Facebook, you can also leave the comment.

Leave a Reply

Your email address will not be published. Required fields are marked *


Notice: Undefined offset: 0 in G:\PleskVhosts\automate-apps.com\httpdocs\wp-content\plugins\cardoza-facebook-like-box\cardoza_facebook_like_box.php on line 924