Quick PDF logo

ExtractFilePageText

Extraction, Page manipulation

Description

Extracts the text of any page in a PDF file.

This function internally uses the direct access functionality. The entire file is not loaded into memory, so this function can be used on arbitrarily large documents.

Two different methods are provided for extracting text from the selected page in a variety of output formats.

The DASetTextExtractionWordGap, DASetTextExtractionOptions and DASetTextExtractionArea functions can be used to adjust the text extraction process.

Syntax

Delphi

function TQuickPDF0813.ExtractFilePageText(InputFileName, Password: WideString; 
  Page, Options: Integer): WideString;

ActiveX

Function QuickPDF0813.PDFLibrary::ExtractFilePageText(InputFileName As String,
  Password As String, Page As Long, Options As Long) As String

DLL

wchar_t * QuickPDFExtractFilePageText(int InstanceID, wchar_t * InputFileName,
  wchar_t * Password, int Page, int Options)

Parameters

InputFileName The path and file name of the file to extract text from.
Password The password to use, if any, when opening the file
Page The number of the page that must be extracts. The first page in the document is page 1.
Options Using the standard text extraction algorithm:
0 = Extract text in human readable format
1 = Deprecated
2 = Return a CSV string including font, color, size and position of each piece of text on the page
Using the more accurate text extraction algorithm:
3 = Return a CSV string for each piece of text on the page with the following format:
Font Name, Text Color, Text Size, X1, Y1, X2, Y2, X3, Y3, X4, Y4, Text
The co-ordinates are the four points bounding the text, measured in points (1/72 inch) with the bottom-left corner of the page as the origin. Co-ordinate order is anti-clockwise with the bottom left corner first.
4 = Similar to option 3, but individual words are returned, making searching for words easier
5 = Similar to option 3 but character widths are output after each line
6 = Similar to option 4 but character widths are output after each line
7 = Extract text in human readable format with improved accuracy compared to option 0
8 = Similar to option 7 but without layout formatting

Copyright © 2011 Debenu. All rights reserved. AboutContactBlogNewsletterSupportBuyForum