Extracting text with iTextSharp throws an InvalidCastException


If you are using iTextSharp library to read/write PDF files in your application and faced an InvalidCastException with the message Unable to cast object of type iTextSharp.text.pdf.PdfLiteral to type iTextSharp.text.pdf.PdfString, this post will help you to know the root cause of the issue. Continue reading to learn more about it. - Article authored by Kunal Chowdhury on .

If you are using 'iTextSharp' library to read/write PDF files in your application and faced an InvalidCastException with the message "Unable to cast object of type 'iTextSharp.text.pdf.PdfLiteral' to type 'iTextSharp.text.pdf.PdfString'", this post will help you to know the root cause of the issue. Continue reading to learn more about it.

 

Extracting text with iTextSharp throws an InvalidCastException (www.kunal-chowdhury.com)

 

'iTextSharp' is a very popular 3rd party library to read/write PDF documents using C#. It also has supported library for Java. If you are using the library in your application, you already know the power of it.

 

While processing/reading few PDF documents, the library may through the following InvalidCastException, that states the message: "Unable to cast object of type 'iTextSharp.text.pdf.PdfLiteral' to type 'iTextSharp.text.pdf.PdfString'" with the following Stack Trace:

 

  Message :Unable to cast object of type 'iTextSharp.text.pdf.PdfLiteral' 
           to type 'iTextSharp.text.pdf.PdfString'.
  Source :itextsharp
  Stack Trace :   at iTextSharp.text.pdf.DocumentFont.FillMetrics(Byte[] touni, IntHashtable widths, Int32 dw)
     at iTextSharp.text.pdf.DocumentFont.ProcessType0(PdfDictionary font)
     at iTextSharp.text.pdf.DocumentFont.Init()
     at iTextSharp.text.pdf.DocumentFont..ctor(PRIndirectReference refFont)
     at iTextSharp.text.pdf.CMapAwareDocumentFont..ctor(PRIndirectReference refFont)
     at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.GetFont(PRIndirectReference ind)
     at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.SetTextFont.Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List`1 operands)
     at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.InvokeOperator(PdfLiteral oper, List`1 operands)
     at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources)
     at iTextSharp.text.pdf.parser.PdfReaderContentParser.ProcessContent[E](Int32 pageNumber, E renderListener, IDictionary`2 additionalContentOperators)
     at iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(PdfReader reader, Int32 pageNumber, ITextExtractionStrategy strategy)

 

 

The library parses the syntax of the PDF file to find specific PDF objects. When it detects some unidentifiable objects that doesnot fit into any of the defined PDF objects in the specification, it creates a PdfLiteral object from it.

 

When those unidentifiable PdfLiteral object is expected to read as PdfString, the iTextSharp library throws the above InvalidCastException. If you check the Stack Trace, you will notice that the internal code line iTextSharp.text.pdf.parser.PdfContentStreamProcessor.GetFont(PRIndirectReference ind) actually triggered the Exception.

 

As stated in StackOverflow, the PDF document, which threw the exception, contains a font 'Calibri' (which is a subset composite font) with the following ToUnicode map:

 

  /CIDInit /ProcSet findresource
  begin
  12 dict
  begin
  /CIDSystemInfo <</Ordering (UCS) /Registry (Adobe) /Supplement 0 >> def
  /CMapName /Adobe-Identity-UCS def
  /CMapType 2 def
  1 begincodespacerange
  <0000> <ffffffffffffffff> endcodespacerange
  20 beginbfchar
  <0003> <0020> <0012> <0043> 
  <0018> <0044> <0045> <004e> <0059> <0051> 
  <005e> <0053> <0102> <0061> <0110> <0063> 
  <011a> <0064> <011e> <0065> <015d> <0069> 
  <0175> <006d> <0176> <006e> <017d> <006f> 
  <01ffffff89> <0070> <01ffffff8c> <0072> 
  <01ffffff90> <0073> <01ffffff9a> <0074> <01ffffffb5> <0075> 
  <01ffffffc7> <0079> endbfchar
  100 beginbfchar 
  <01ffffffcc> <007a> endcmap
  CMapName
  currentdict
  /CMap defineresource
  pop
  end
  end
  ý

 

If you observe the above sample unicode map, you could find a 'beginbfchar' without a 'endbfchar'. Instead, it ended with 'endcmap' which caused the above exception to trigger (check the highlighted code above).

 

If you faced this exception in your application, nothing wrong in your code. You just need to handle it properly so that, you can skip reading those PDF documents in the question for the InvalidCastException from iTextSharp library. Hope the post was helpful. Don't forget to ask your question, if any.

 

 

Have a question? Or, a comment? Let's Discuss it below...

dhgate

Thank you for visiting our website!

We value your engagement and would love to hear your thoughts. Don't forget to leave a comment below to share your feedback, opinions, or questions.

We believe in fostering an interactive and inclusive community, and your comments play a crucial role in creating that environment.