ExtractFilePageText
Description
Extracts the text of any page in a PDF file.
This function internally uses the direct access functionality. The entire file is not loaded into memory, so this function can be used on arbitrarily large documents.
Two different methods are provided for extracting text from the selected page in a variety of output formats.
The DASetTextExtractionWordGap, DASetTextExtractionOptions and DASetTextExtractionArea functions can be used to adjust the text extraction process.
Syntax
Delphi
function TQuickPDF0813.ExtractFilePageText(InputFileName, Password: WideString;
Page, Options: Integer): WideString;
ActiveX
Function QuickPDF0813.PDFLibrary::ExtractFilePageText(InputFileName As String,
Password As String, Page As Long, Options As Long) As String
DLL
wchar_t * QuickPDFExtractFilePageText(int InstanceID, wchar_t * InputFileName,
wchar_t * Password, int Page, int Options)
Parameters
| InputFileName | The path and file name of the file to extract text from. |
| Password | The password to use, if any, when opening the file |
| Page | The number of the page that must be extracts. The first page in the document is page 1. |
| Options |
Using the standard text extraction algorithm: 0 = Extract text in human readable format 1 = Deprecated 2 = Return a CSV string including font, color, size and position of each piece of text on the page Using the more accurate text extraction algorithm: 3 = Return a CSV string for each piece of text on the page with the following format: Font Name, Text Color, Text Size, X1, Y1, X2, Y2, X3, Y3, X4, Y4, Text The co-ordinates are the four points bounding the text, measured in points (1/72 inch) with the bottom-left corner of the page as the origin. Co-ordinate order is anti-clockwise with the bottom left corner first. 4 = Similar to option 3, but individual words are returned, making searching for words easier 5 = Similar to option 3 but character widths are output after each line 6 = Similar to option 4 but character widths are output after each line 7 = Extract text in human readable format with improved accuracy compared to option 0 8 = Similar to option 7 but without layout formatting |