Recovering graphics from a broken PDF

This was a Forensics CTF challenge from Cyberthon 2019’s Online Training. I do not have the original challenge title and description with me, although I do remember it was worth the highest possible points for a challenge and First Blood was mine :) I had forgotten what the solution was, but recently I decided to revisit this and figure it out once again after stumbling upon the mess of files I left behind on my hard drive from my original attempts to solve it back then.
In hindsight, while the solution is not difficult and could be pretty short, I found this challenge to be a good opportunity to showcase the different tools that one can usually look to when involved in PDF forensics. Thus, I’ll be taking a more long-winded and exploratory path below. Feel free to skip to the Summary!

Tools Mentioned	Links
010 Editor	https://www.sweetscape.com/010editor/
Didier Stevens’ PDF Tools Masterpost	https://blog.didierstevens.com/programs/pdf-tools/
pdf-parser	http://didierstevens.com/files/software/pdf-parser_V0_7_4.zip
pdfid	http://didierstevens.com/files/software/pdfid_v0_2_7.zip
polyfile	https://pypi.org/project/polyfile/
mutool	https://www.mupdf.com/downloads/index.html
qpdf	https://github.com/qpdf/qpdf
iLovePDF Repair PDF Online	https://www.ilovepdf.com/repair-pdf

Details

We are given VoIP-Research.pdf which throws an error message if opened in standard PDF viewers.

PDF format error message in FoxitPhantom PDF

So how do we check out the contents of this PDF? We can use 010 Editor, which is essentially a hex editor on steroids because it comes with binary templates that parses various file format structures and displays them for easy analysis. Conveniently, 010 Editor already comes with a PDF.bt binary template from its default repository which can parse our malformed PDF file for us.

Parsed PDF structure in 010 Editor after running template

Running the template shows us that 4 PDF objects were found! Objects 4 and 5 (denoted as 4 0 obj and 5 0 obj) contain streams of length 1 so we can ignore those. Objects 3 and 6 (3 0 obj and 6 0 obj), on the other hand, are much larger-sized compressed FlateDecode streams and hence are of greater interest.
Now if we were to run Didier Steven’s pdf-parser, we should be able to list the same objects, decode the streams we need and see what’s in there.

Command:

1

pdf-parser.py VoIP-Research.pdf

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33


PDF Comment '%PDF-1.7\n'

PDF Comment '%\xe2\xe3\xcf\xd3\r\n'

obj 3 0
 Type:
 Referencing:
 Contains stream

  <<
    /Filter /FlateDecode
    /Length 4616
  >>


obj 4 0
 Type:
 Referencing:
 Contains stream

  <<
    /Length 1
  >>


obj 5 0
 Type:
 Referencing:
 Contains stream

  <<
    /Length 1
  >>

Wait what? Why do we only get 3 objects - Objects 3, 4, and 5? Where did our previously-seen object 6 go?
For the sake of introducing more tools, let’s also test this out with polyfile. It can be installed with the standard pip3 install polyfile and then ran with a flag to generate an interactive HTML file for us to explore the structure of the PDF.

1

polyfile --html voip.html VoIP-Research.pdf

Parsed PDF structure in generated HTML by polyfile

Interestingly, polyfile also only manages to display 3 of the objects.
To figure out what’s going on, let’s look back deeper into the PDF objects in 010 Editor. Comparing the FlateDecode Object 3 which displays correctly and the missing FlateDecode Object 6, we notice that Object 6 lacks an EndObject and WhiteSpace3 after its Data section. It just goes straight to the PDFTrailer denoted by %%EOF.

Object 6 missing EndObject and WhiteSpace3 when compared to Object 3 in 010 Editor

In fact, Object 6 is also lacking the endstream keyword at the end of its Data stream section. We can see that Objects 3, 4, and 5 have this keyword before endobj. As specified on page 36, section 3.2.7 ‘Stream Objects’ of the PDF Reference, Third Edition, this endstream keyword is necessary along with the starting stream keyword as a matching pair. This is also the reason why Object 6 was not picked up by pdf-parser or polyfile.

endstream keyword present in Objects 3, 4, and 5 in 010 Editor

If we run Didier Stevens’ pdfid on the PDF with pdfid.py VoIP-Research.pdf, we can actually see that there are a mismatched number of obj/endobjs and stream/endstreams.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


PDFiD 0.2.7 VoIP-Research.pdf
 PDF Header: %PDF-1.7
 obj                    4
 endobj                 3
 stream                 4
 endstream              3
 xref                   0
 trailer                0
 startxref              0
 /Page                  0
 /Encrypt               0
 /ObjStm                0
 /JS                    0
 /JavaScript            0
 /AA                    0
 /OpenAction            0
 /AcroForm              0
 /JBIG2Decode           0
 /RichMedia             0
 /Launch                0
 /EmbeddedFile          0
 /XFA                   0
 /URI                   0
 /Colors > 2^24         0

While we could remedy this by manually adding in the missing keywords, let’s introduce another tool to help fix this automatically: mutool. In this case, we will use mutool clean which happens to repair the broken object. Do bear in mind that this may not necessarily always work. From mutool clean’s output we see that it has correctly identified that a PDF object is missing the endobj token. It also tried to fix other problems like a missing xref table, though this will not be relevant to the solution.
Just for the record, both qpdf and https://www.ilovepdf.com/repair-pdf did not work to repair the PDF file automatically. But you can always keep them in mind when trying to repair other files.

Command:

1

mutool clean VoIP-Research.pdf output.pdf

Output:

1
2
3
4
5


error: cannot find startxref
warning: trying to repair broken xref
warning: repairing PDF document
warning: object missing 'endobj' token
warning: PDF stream Length incorrect

pdfid now shows that we have a matching number of obj/endobjs and stream/endstreams

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


PDFiD 0.2.7 output.pdf
 PDF Header: %PDF-1.7
 obj                    4
 endobj                 4
 stream                 4
 endstream              4
 xref                   1
 trailer                1
 startxref              1
 /Page                  0
 /Encrypt               0
 /ObjStm                0
 /JS                    0
 /JavaScript            0
 /AA                    0
 /OpenAction            0
 /AcroForm              0
 /JBIG2Decode           0
 /RichMedia             0
 /Launch                0
 /EmbeddedFile          0
 /XFA                   0
 /URI                   0
 /Colors > 2^24         0

and pdf-parser is now able to detect Object 6!

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56


PDF Comment '%PDF-1.7\n'

PDF Comment '%\xc2\xb5\xc2\xb6\n\n'

obj 3 0
 Type:
 Referencing:
 Contains stream

  <<
    /Filter /FlateDecode
    /Length 4616
  >>


obj 4 0
 Type:
 Referencing:
 Contains stream

  <<
    /Length 1
  >>


obj 5 0
 Type:
 Referencing:
 Contains stream

  <<
    /Length 1
  >>


obj 6 0
 Type:
 Referencing:
 Contains stream

  <<
    /Filter /FlateDecode
    /Length 76765
  >>


xref

trailer
  <<
    /Size 7
  >>

startxref 81636

PDF Comment '%%EOF\n'

With this, we can inflate Object 6 and inspect its contents.

Command:

1

pdf-parser.py output.pdf -o 6 -f

Output (truncated):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


obj 6 0
 Type:
 Referencing:
 Contains stream

  <<
    /Filter /FlateDecode
    /Length 76765
  >>

 'q Q q /Cs1 cs 0.8117647 0.1058824 0.06666667 sc 582.8205 175.4982 m 583.2763\n173.5773 l 583.7512 171.5763 l 583.9353 170.8004 l 583.9353 170.8004 l 584.0098\n170.4865 584.3246 170.2924 584.6385 170.3669 c 584.6385 170.3669 l 584.9524\n170.4414 585.1465 170.7562 585.072 171.0701 c 584.8879 171.846 l 584.413 173.847\n
 ...

Object 6 appears to be some kind of graphics object because of its various graphics operators like q, Q, sc, m, and l. This is detailed on page 134, in table 4.1 ‘Operator Categories’ of the same PDF Reference, Third Edition we used earlier.
To recover this graphics object, we can use mutool create to create a completely new PDF file based on graphics commands of our choice.
First, replace all occurrences of the string \n with actual newline characters in the decoded object stream output from pdf-parser:

Replacing newline characters in Sublime Text Editor

Second, saving the replaced object stream contents into a text file like graphics.txt, we can then use

1

mutool create graphics.txt

to create a new PDF with the default name of out.pdf based on the graphic operators and operands in graphics.txt.
We can now open the PDF and view the graphical flag with no problems 😄

Flag: CTFSG{D0GG0_8T3_MY_P4P3R}

Summary

Inspect PDF in 010 Editor with binary template and spot the malformed PDFObj. This research can be augmented with output from pdf-parser and pdfid.
Fix the malformed PDFObj manually or with mutool clean.
Inspect stream contents of PDFObj, decoding with filters if needed, to determine the type of object.
Fix formatting of stream contents with a simple find and replace
Use mutool create to generate a new clean PDF file with the object contents so that we can view it

A possible shortcut solution that I have not tested out yet is to just copy out the compressed object stream from 010 Editor and inflate it elsewhere with some Python module. Perhaps you could try this out with any other similar PDF and let me know! I’m not sure if I’m allowed to share this particular challenge PDF file because there may be some sort of challenge creators’ copyright issues.

Thanks for reading! I hope at least one of these tools covered will be useful to you in the future.