Checking text in a PDF file with LoadRunner

A friend was performance testing a new SAP system that had a self-service portal where employees could view and download a PDF version of their payslip. He asked me how he could add a check to his LoadRunner script to make sure that the information in the PDF was correct (employee name, pay grade, etc). Manually checking all the PDFs at the end of the run was not a scalable solution (and just checking a sample of them would have missed any intermittent errors).

Fortunately, it was a simple matter to write some code in VuGen that would solve his problem.

My solution uses the popen() function to call a command-line program from LoadRunner which converts the PDF to text. The text can then be verified using simple string comparison functions. The command line program I used is pdftotext, which is part of the open-source Xpdf project. I have mirrored the file, so you can download pdftotext (from Xpdf v3.03) directly from my website.

Here is the example code:

char *strstr( const char *string1, const char *string2); // Explicit declaration required for C functions that do not return integers.
#define BUFFER_SIZE 16384 // 16 KB. If this gets too big, use malloc() instead to avoid "too many local variables" compilation error.

Action()
{
    int pdf_size; // size of the downloaded PDF (determined at runtime)
    int fp_save; // file pointer for saving the PDF
    int fp_popen; // file/stream pointer for the output from popen()
    int count; // number of characters that have been read from the popen() stream
    char buffer[BUFFER_SIZE]; // allocate memory for the output of popen()
	
    lr_start_transaction("download_pdf");

    // Save entire HTTP response body
    // Note that the PDF download must be in a step by itself in order to save the contents. It cannot be an EXTRARES.
    web_reg_save_param_ex(
        "ParamName=PdfContents",
        "LB/BIN=",
        "RB/BIN=",
        SEARCH_FILTERS,
        "Scope=Body",
        LAST);

    web_reg_find("Text=PDF", LAST); // Basic check
    web_url("Download PDF", 
        "URL=https://www.myloadtest.com/wp-content/uploads/migrated-resources/hp-discover-2012-tb2337-moncrieff.pdf", 
        "Resource=0", // Not a resource 
        "RecContentType=application/pdf", 
        LAST);	

    pdf_size = web_get_int_property(HTTP_INFO_DOWNLOAD_SIZE);
    if (pdf_size < 100000){
        lr_error_message("Downloaded PDF is smaller than expected."); // By itself, this is not a very good check.
        lr_exit(LR_EXIT_ACTION_AND_CONTINUE, LR_FAIL);
    }
	
    lr_end_transaction("download_pdf", LR_AUTO);
    lr_think_time(10);
    lr_start_transaction("check_pdf");
	
    // Save file.
    lr_save_string(lr_eval_string("C:\\LoadRunner\\Data\\{TimeStamp}{VuserId}.pdf"), "FileName"); // Ensure that the file name is unique to avoid overwriting the contents
    fp_save = fopen(lr_eval_string("{FileName}"), "wb"); // open file in "write, binary" mode.
    fwrite(lr_eval_string("{PdfContents}"), pdf_size, 1, fp_save);
    fclose(fp_save);
	
    // Run "C:\LoadRunner\Lib\pdftotext {FileName} -" and save the output to a parameter.
    fp_popen = popen(lr_eval_string("C:\\LoadRunner\\Lib\\pdftotext {FileName} -"), "r");
    count = fread(buffer, sizeof(char), BUFFER_SIZE, fp_popen);
    if (feof(fp_popen) == 0) {
        lr_error_message("Did not reach the end of the input stream when reading. Try increasing BUFFER_SIZE.");
        return -1;
    }
    buffer[count] = NULL; // Make sure the string in the buffer is null-terminated.
    lr_save_string(buffer, "PdfText");
    lr_output_message("The Windows version is: %s", buffer);
    pclose(fp_popen);
 
    // Check that the PDF parameter contains the expected text ("LoadRunner").
    if (strstr(lr_eval_string("{PdfText}"), "LoadRunner") == NULL) {
        lr_error_message("PDF file does not contain the expected string.");
        lr_exit(LR_EXIT_ACTION_AND_CONTINUE, LR_FAIL);
    }
	
    // Cleanup. Delete the file.
    system(lr_eval_string("del /q {FileName}"));
	
    lr_end_transaction("check_pdf", LR_AUTO);
	
    return 0;
}

Two caveats:

The code above has no error checking/handling. It is very quick and dirty.
pdftotext is a text extraction utility, not an OCR program, so it cannot extract text that is actually an image (like a scanned document).

Published On: July 28, 2012Tags: LoadRunner, migrated

14 Comments

Stuart Moncrieff July 29, 2012 at 4:09 pm

It might be useful to modify the above code to use pdftohtml (http://pdftohtml.sourceforge.net/) instead of pdftotext, as this would allow you to use XPath to extract values from specific parts of the document.
Parthasarathy J September 21, 2012 at 8:48 pm

nice one, thanks for suggesting. isnt that sufficient to cross check by basic pdf check and with its size being downloaded with actual one ?
- Stuart Moncrieff September 22, 2012 at 11:09 am
  
  In some previous projects I have been lazy and just checked the size of the PDF, rather than the dynamically generated content. It is definitely a much better idea to also do some basic checks of the content (e.g. employee name on payslip).
  
  Now that I have provided the example code, it should only add 5 minutes to script development time to improve the quality of your verification.
  
  You might find some interesting errors like maybe when the system is under load, a small fraction of the PDFs are generated incorrectly.
debasis June 9, 2013 at 11:44 am

Good one Stuart ..just fixed one of my issue with this information …

piyush September 19, 2013 at 6:41 pm

COULD YOU PLEASE SUGGEST HOW WE WILL CODE FOR MALLOC

Stuart Moncrieff December 3, 2013 at 10:56 am

TODO: write example code for malloc

char *strstr( const char *string1, const char *string2); // Explicit declaration required for C functions that do not return integers.
#define BUFFER_SIZE 16384 // 16 KB. If this gets too big, use malloc() instead to avoid "too many local variables" compilation error.
 
Action()
{
    int pdf_size; // size of the downloaded PDF (determined at runtime)
    int fp_save; // file pointer for saving the PDF
    int fp_popen; // file/stream pointer for the output from popen()
    int count; // number of characters that have been read from the popen() stream
    char buffer[BUFFER_SIZE]; // allocate memory for the output of popen()
 
    lr_start_transaction("download_pdf");
 
    // Save entire HTTP response body
    // Note that the PDF download must be in a step by itself in order to save the contents. It cannot be an EXTRARES.
    web_reg_save_param_ex(
        "ParamName=PdfContents",
        "LB/BIN=",
        "RB/BIN=",
        SEARCH_FILTERS,
        "Scope=Body",
        LAST);
 
    web_reg_find("Text=PDF", LAST); // Basic check
    web_url("Download PDF", 
        "URL=https://www.myloadtest.com/resources/hp-discover-2012-tb2337-moncrieff.pdf", 
        "Resource=0", // Not a resource 
        "RecContentType=application/pdf", 
        LAST);	
 
    pdf_size = web_get_int_property(HTTP_INFO_DOWNLOAD_SIZE);
    if (pdf_size < 100000){
        lr_error_message("Downloaded PDF is smaller than expected."); // By itself, this is not a very good check.
        lr_exit(LR_EXIT_ACTION_AND_CONTINUE, LR_FAIL);
    }
 
    lr_end_transaction("download_pdf", LR_AUTO);
    lr_think_time(10);
    lr_start_transaction("check_pdf");
 
    // Save file.
    lr_save_string(lr_eval_string("C:\\LoadRunner\\Data\\{TimeStamp}{VuserId}.pdf"), "FileName"); // Ensure that the file name is unique to avoid overwriting the contents
    fp_save = fopen(lr_eval_string("{FileName}"), "wb"); // open file in "write, binary" mode.
    fwrite(lr_eval_string("{PdfContents}"), pdf_size, 1, fp_save);
    fclose(fp_save);
 
    // Run "C:\LoadRunner\Lib\pdftotext {FileName} -" and save the output to a parameter.
    fp_popen = popen(lr_eval_string("C:\\LoadRunner\\Lib\\pdftotext {FileName} -"), "r");
    count = fread(buffer, sizeof(char), BUFFER_SIZE, fp_popen);
    if (feof(fp_popen) == 0) {
        lr_error_message("Did not reach the end of the input stream when reading. Try increasing BUFFER_SIZE.");
        return -1;
    }
    buffer[count] = NULL; // Make sure the string in the buffer is null-terminated.
    lr_save_string(buffer, "PdfText");
    lr_output_message("The Windows version is: %s", buffer);
    pclose(fp_popen);
 
    // Check that the PDF parameter contains the expected text ("LoadRunner").
    if (strstr(lr_eval_string("{PdfText}"), "LoadRunner") == NULL) {
        lr_error_message("PDF file does not contain the expected string.");
        lr_exit(LR_EXIT_ACTION_AND_CONTINUE, LR_FAIL);
    }
 
    // Cleanup. Delete the file.
    system(lr_eval_string("del /q {FileName}"));
 
    lr_end_transaction("check_pdf", LR_AUTO);
 
    return 0;
}

Dennis Bosman October 25, 2013 at 8:15 pm

Thx! just one question is there a possibility to use popen() without the pdftotext you have provided? love your site keep it up!! greets from the Netherlands
Stuart Moncrieff December 18, 2013 at 10:45 am

Some people have asked how to make sure that the request for the PDF appears in a separate web_url or web_submit_data statement, and not as an EXTRARES.

You can either do it manually (cutting and pasting the URL from your EXTRARES into a web_url statement), or you can get VuGen to automatically generate the PDF request in a separate step. To do this, you must change your recording settings to nominate the “application/pdf” MIME type as a non-resource.

In your VuGen Recording Settings, navigate to HTTP Properties > Advanced > Non-Resources, and add a “non-resource content type” of “application/pdf”.
Anon December 29, 2013 at 6:03 pm

Note that you can delete the file using the remove() function instead of making another system call.
Srini January 17, 2014 at 9:32 am

Hi Stuart,

I have tried setting non-resource content type as application/pdf but I did not see any difference with & without it in my recorded script. I need to read the content of a PDF file loaded within webpage. Any help is appreciated

web_submit_data(“HRHtmlReport.jsp_8”,
“Action=https://{host_4443}/hr/modules/com/hyperion/reporting/web/reportViewer/HRHtmlReport.jsp?instanceId={instID}&showprompts=false&showuserpovDone=true&showuserpovFromMenu=false&rnd=0.{rndNumber}&fr_id={fr_id}&viewAs=pdf”,
“Method=POST”,
“TargetFrame=”,
“RecContentType=text/html”,
“Referer=https://{host_4443}/hr/modules/com/hyperion/reporting/web/reportViewer/HRHtmlReport.jsp?instanceId={instID}&showprompts=false&showuserpovDone=true&showuserpovFromMenu=false&rnd=0.{rndNumber}&fr_id={fr_id}&viewAs=pdf”,
“Snapshot=t81.inf”,
“Mode=HTML”,
ITEMDATA,
“Name=showCancel”, “Value=false”, ENDITEM,
“Name=instanceId”, “Value={instID}”, ENDITEM,
“Name=showuserpovFromMenu”, “Value=false”, ENDITEM,
“Name=fr_id”, “Value={fr_id}”, ENDITEM,
“Name=rnd”, “Value=0.{rndNumber}”, ENDITEM,
“Name=showprompts”, “Value=false”, ENDITEM,
“Name=showuserpovDone”, “Value=true”, ENDITEM,
“Name=viewAs”, “Value=pdf”, ENDITEM,
LAST);
siva February 27, 2014 at 9:47 pm

Good one Stuart ….i created this script , i changed HTTP-request receive timeout time , connection timeout, step download timeout and i give proper proxy details also even tho i get the below error message.
Action.c(18): Error -27796: Failed to connect to server “www.myloadtest.com:80”: [10060] Connection timed out [MsgId: MERR-27796]

currently i am working with LoadRunner 11.50.0 in Windows 7

this error is not only for this script, for so many websites i got same error.

can u explain me how to fix that error.

Thanks in advance
raju February 28, 2014 at 3:25 pm

when i am trying to run the above script it shows below error.

Action.c(18): Error -27796: Failed to connect to server “www.myloadtest.com:80″: [10060] Connection timed out [MsgId: MERR-27796]

could u please tell me how to fix it.

Thanks in advance.
- priya December 29, 2015 at 5:07 pm
  
  web_set_timeout(CONNECT, “900”);
  web_set_timeout(RECEIVE, “900”);
  web_set_timeout(STEP, “900”);
  try inserting these functions
rajesh April 14, 2015 at 9:23 pm

I am new to web services testing in loadrunner. i need some help

How we can store web service input values and to print in test results ..

Can u plz give me code to check