Extracting text with iTextSharp throws an InvalidCastException

Extracting text with iTextSharp throws an InvalidCastException


If you are using 'iTextSharp' library to read/write PDF files in your application and faced an InvalidCastException with the message "Unable to cast object of type 'iTextSharp.text.pdf.PdfLiteral' to type 'iTextSharp.text.pdf.PdfString'", this post will help you to know the root cause of the issue. Continue reading to learn more about it.

 

Extracting text with iTextSharp throws an InvalidCastException (www.kunal-chowdhury.com)

 

'iTextSharp' is a very popular 3rd party library to read/write PDF documents using C#. It also has supported library for Java. If you are using the library in your application, you already know the power of it.

 

While processing/reading few PDF documents, the library may through the following InvalidCastException, that states the message: "Unable to cast object of type 'iTextSharp.text.pdf.PdfLiteral' to type 'iTextSharp.text.pdf.PdfString'" with the following Stack Trace:


  Message :Unable to cast object of type 'iTextSharp.text.pdf.PdfLiteral' 
           to type 'iTextSharp.text.pdf.PdfString'.
  Source :itextsharp
  Stack Trace :   at iTextSharp.text.pdf.DocumentFont.FillMetrics(Byte[] touni, IntHashtable widths, Int32 dw)
     at iTextSharp.text.pdf.DocumentFont.ProcessType0(PdfDictionary font)
     at iTextSharp.text.pdf.DocumentFont.Init()
     at iTextSharp.text.pdf.DocumentFont..ctor(PRIndirectReference refFont)
     at iTextSharp.text.pdf.CMapAwareDocumentFont..ctor(PRIndirectReference refFont)
     at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.GetFont(PRIndirectReference ind)
     at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.SetTextFont.Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List`1 operands)
     at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.InvokeOperator(PdfLiteral oper, List`1 operands)
     at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources)
     at iTextSharp.text.pdf.parser.PdfReaderContentParser.ProcessContent[E](Int32 pageNumber, E renderListener, IDictionary`2 additionalContentOperators)
     at iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(PdfReader reader, Int32 pageNumber, ITextExtractionStrategy strategy)

The library parses the syntax of the PDF file to find specific PDF objects. When it detects some unidentifiable objects that doesnot fit into any of the defined PDF objects in the specification, it creates a PdfLiteral object from it.

 

When those unidentifiable PdfLiteral object is expected to read as PdfString, the iTextSharp library throws the above InvalidCastException. If you check the Stack Trace, you will notice that the internal code line iTextSharp.text.pdf.parser.PdfContentStreamProcessor.GetFont(PRIndirectReference ind) actually triggered the Exception.

 

As stated in StackOverflow, the PDF document, which threw the exception, contains a font 'Calibri' (which is a subset composite font) with the following ToUnicode map:


  /CIDInit /ProcSet findresource
  begin
  12 dict
  begin
  /CIDSystemInfo <</Ordering (UCS) /Registry (Adobe) /Supplement 0 >> def
  /CMapName /Adobe-Identity-UCS def
  /CMapType 2 def
  1 begincodespacerange
  <0000> <ffffffffffffffff> endcodespacerange
  20 beginbfchar
  <0003> <0020> <0012> <0043> 
  <0018> <0044> <0045> <004e> <0059> <0051> 
  <005e> <0053> <0102> <0061> <0110> <0063> 
  <011a> <0064> <011e> <0065> <015d> <0069> 
  <0175> <006d> <0176> <006e> <017d> <006f> 
  <01ffffff89> <0070> <01ffffff8c> <0072> 
  <01ffffff90> <0073> <01ffffff9a> <0074> <01ffffffb5> <0075> 
  <01ffffffc7> <0079> endbfchar
  100 beginbfchar 
  <01ffffffcc> <007a> endcmap
  CMapName
  currentdict
  /CMap defineresource
  pop
  end
  end
  ý

 

If you observe the above sample unicode map, you could find a 'beginbfchar' without a 'endbfchar'. Instead, it ended with 'endcmap' which caused the above exception to trigger (check the highlighted code above).

 

If you faced this exception in your application, nothing wrong in your code. You just need to handle it properly so that, you can skip reading those PDF documents in the question for the InvalidCastException from iTextSharp library. Hope the post was helpful. Don't forget to ask your question, if any.

 

 



If you have come this far, it means that you liked what you are reading. Why not reach little more and connect with me directly on Twitter, Facebook, Google+ and LinkedIn. I would love to hear your thoughts and opinions on my articles directly. Also, don't forget to share your views and/or feedback in the comment section below.

0 comments

 
© 2008-2017 Kunal-Chowdhury.com - Microsoft Technology Blog for developers and consumers | Designed by Kunal Chowdhury
Back to top