Do you own a Debenu Quick PDF Library 12/11/10/9/8/7? Upgrade to Debenu Quick PDF Library 13!

Foxit Quick PDF Library

Frequently Asked Question:

Return to FAQ Index

How do I retrieve all URLs and related text from a PDF?

Question

How do I retrieve all URLs and related text from a PDF?

Answer

Unfortunately the way PDFs work, the text of the link and the hotspot itself are not related in any way.

You can have text and graphics anywhere on that page and the hotspot links are annotations floating on a layer above the page.

So to get all the links would be a two step process and some logic would have to be written by the customer to correlate the two sets of information to match the URL to the text of the link.

Here's the procedure to use:

Step 1. Get the URLs and locations of the hotspot links

QP.LoadFromFile(...)
QP.SelectPage(...)
For X = 1 to QP.AnnotationCount
URL = QP.GetAnnotStrProperty(X, 111)
Left = QP.GetAnnotDblProperty(X, 105)
Top = QP.GetAnnotDblProperty(X, 106)
Width = QP.GetAnnotDblProperty(X, 107)
Height = QP.GetAnnotDblProperty(X, 108)
// Store this information in an array
Next X

Step 2. Get the location of blocks of text on the page

QP.SelectPage(...)
PageText = QP.GetPageText(3)
// Split PageText into rows
// Process each row:
// Font Name, Text Color, Text Size, X1, Y1, X2, Y2, X3, Y3, X4, Y4, Text
// Store this information in a second array

Step 3. Compare the information in the two arrays to match URLs with blocks of text

A good approach might be to expand the rectangle of the hotspot link by a certain percentage and then check if the (X1, Y1) .. (X4, Y4) are inside the hotspot rectangle.

There is no guarantee that individual words of the link text will be returned as a single block - so multiple rows of GetPageText output may be within the hotspot rectangle area. Also multiple blocks of text may not be in "visual" order.


© 2015 Debenu & Foxit. All rights reserved. AboutBuyContactBlogNewsletterSupportFAQProduct UpdatesForum