
Monday, February 17, 2014

Forensics Quickie: PDF Metadata Forensics (Sunday Funday Answer)

FORENSICS QUICKIES! These posts will consist of small tidbits of useful information that can be explained very succinctly.

David Cowen's Sunday Funday contests are great. In short, a question is posed about a subject, and readers get the opportunity to answer and explain how it is relevant in a forensic investigation. Much can be learned from these contests, as a great answer is almost always delivered and posted publicly. I highly recommend checking it out and submitting answers (it's a weekly thing, so the subjects vary widely).

This week's (2/16/14) subject was PDF Metadata. I took a stab at answering and wound up winning, so I figured it wouldn't hurt to post my answer here. The original answer, along with a preface by David Cowen, can be found here.

The Questions
The subject's prompt was as follows:

1. What metadata can be present within a PDF document?
2. What affects the types of metadata that will be present within a PDF document?

The Answer
The following is my answer to the prompt. Disclaimer: this was written in the wee-hours of one night, so if there are any typos or errors, please feel free to contact me.

Much can be gleaned from PDF metadata. Of course, there are the standard fields that will provide you with relatively common metadata...but depending on the program you use to create the PDF, there could be much, much more.

Let's start with the most common metadata. Most PDFs will have embedded metadata showing the PDF version, creation date, creation program, document ID, and last modified date. There are definitely more, but I have left them out as they wouldn't be very useful in an investigation (e.g. page count). The use for the aforementioned metadata is fairly obvious, but I will explain nonetheless.

The more obvious PDF metadata entries are:
  • Creation program: program used to create the PDF (was it through desktop software, was it scanned, etc.)
  • Creator/Author/Producer: Username or full name of the PDF's author OR further details on the program used to create the PDF (is it a previous employer?)
  • Title: the title of the PDF that usually provides an outdated name for the document; good for identifying previous employer documents or documents that have been converted from one format into a PDF (e.g. SecretBluePrint.eps or oldCompanyFinances.doc shows up in the 'title' metadata entry)
Those are the easy ones. But what about the more overlooked metadata? As I mentioned before, the program used to create or modify the PDF may have a huge impact on what information you are given. With that, let's look into it.

First, timestamps. We know that file metadata could potentially serve as a better indicator of when a given document was created. If the PDF has been transferred across various volumes and systems -- and we would like to find the origin of the document -- the creation date in the file metadata is going to be more reliable than the file system creation date (as the filesystem date/time will have been updated with the copies/moves).
A More Reliable Creation Date

The metadata 'creation' date will [usually] preserve the REAL date of the file's creation. That is, if the PDF has been transferred across various volumes and systems, the 'creation' date in the file's metadata is going to give us a better idea of when the document was initially created.

The 'modified' date can be used in a similar way. We might even be able to tell how many programs through which the PDF was modified/saved. Say we have a PDF created using Adobe InDesign. If we were to open this PDF, modify it, and then save it as a new file using 'Save As...' in a program like Adobe Acrobat, we would see that the 'creation' date is still unchanged, but the 'modified' date had been updated (file system creation dates will tell us differently). Pretty standard stuff. Even if the PDF is saved using 'Save As...' (essentially creating a new file altogether with an updated file system creation time) AND it is moved from one system to the next, we will still have a genuine 'creation' date. Not only that, but we will have a metadata 'modified' date AND a new file system creation time to work with. Correlation among file metadata and file system timestamps is beyond the scope of this answer, but you get the point; 'creation' and 'modified' metadata dates are powerful and can be used creatively.

Also, with many PDF timestamps, we will be able to see a timezone offset. For example, a creation timestamp could be 2013:02:22 11:21:34-06:00. We now have a potential indication that the program that produced this PDF was set in Mountain time.

I mentioned that we might be able to determine if a PDF was created and modified through more than one program. As a quick side note, and if we really wanted to dive into the PDF analysis, we could take a look at some of the other telling metadata. The example above suggested creating a PDF in InDesign, opening it up in Acrobat, modifying it, and then saving it as a new file. When this happens, some of the metadata in the new file (like the 'modified' time) is updated while all of the InDesign metadata stays intact. However, there is a significant difference this time around: the 'XMP Toolkit' metadata value is different. Adobe implements their XMP Toolkit in all of their applications and plugins. They even open sourced it, so other programs can use it (and many do). The point is, "the XMPCore component of this toolkit is what allows for the creation and modification of the metadata that follows the XMP Data Model" (more here and here). So we have two PDFs, but the metadata for each was manipulated by two different versions of Adobe's XMP Core.

InDesign used:
"Adobe XMP Core 5.3-c011 66.145661, 2012/02/06-14:56:27" and...

Acrobat used:
"Adobe XMP Core 5.4-c005 78.147326, 2012/08/23-13:03:03"

But why is this important? Well, we can now more accurately pinpoint the program used to create the PDF. Sure, we will likely already have a metadata entry that tells us the 'Creation Program,' but consider the above example; that tool (InDesign) may have been used to initially create the PDF, but it was NOT used to open, modify, and save a new version of it (Acrobat did that). Let's keep this in mind as we explain some other interesting metadata...

Remember: the amount of metadata that a program uses when creating files is limitless. XMP is built on XML, so any metadata tags can be defined. Let's take a real-world example of how powerful PDF metadata can be when created from certain programs. Download Trustwave's Global Security Report PDF from 2013. Run it in exiftool. What do you see? That's right, the "History" metadata fields will show you not only that the document was saved 497 times, but it will also show you the exact times that is was saved, the program used to save it each time, and the Document Instance ID for each save (less exciting).

While you have that open, take a look at the creation date (2013:02:22 11:21:34-06:00) and modify date (2013:05:09 10:47:39-07:00). The modify date is much later, but the last "History" save on the file was 2013:02:22 11:18:06-06:00. What's up with that? This is because the PDF was modified in a different program; one newer than InDesign CS5.5. How do I know this? Well, look at the XMP Core version. The XMP Core version used for InDesign CS5.5 is "Adobe XMP Core 5.4-c005 78.147326, 2012/08/23-13:03:03." I just so happen to have a PDF created with InDesign CS6 and that PDF uses "Adobe XMP Core 5.3-c011 66.145661, 2012/02/06-14:56:27." How can it be that CS5.5 is using a later XMP Core version than CS6?! Because another program was used to modify the CS5.5 PDF after the last save. On 2013:05:09 10:47:39-07:00 (the modify date), some program (let's just say it's Acrobat to satisfy my example from before) modified the PDF. The XMP Core version shown in the metadata is NOT from CS5.5.

Also from the 'History' metadata, we can tell that the creation date is actually "2012:12:29 11:20:49-06:00." and NOT "2013:02:22 11:21:34-06:00." My guess is that InDesign was keeping track of the saves, but when it came down to exporting the PDF, it tacked on the export date as the "Create Date" (as the last 'History' save of the file is 3 minutes before the alleged "Create Date").

If we really wanted to, we could use another metadata field (the PDF version) to further pinpoint the program used. If the PDF version is 1.7, we could look for programs on a suspect computer that save PDFs to version 1.7 by default. Believe it or not, many programs still save PDFs as version 1.4, 1.5, and 1.6.

After all of this, I think it's safe to say that PDF metadata can be pretty valuable. You just need to know what's available to you and how to interpret it.


Thursday, February 6, 2014

Forensics Quickie: Pinpointing Recent File Activity - RecentDocs

FORENSICS QUICKIES! These posts will consist of small tidbits of useful information that can be explained very succinctly.

[UPDATE #02 06/09/2016]: Juan Aranda (@jharanda1) wrote a more portable Python script to automate the process detailed in this post. It uses Willi Ballenthin's (@williballenthin) python-registry. The script by Eric (below) used a module that could only be run on Windows (and left some clutter when run against an NTUSER hive). Being able to run Juan's script on OS X, Windows, and Linux is helpful. I really like the output format and that it doesn't leave any extra clutter after parsing, so this is my go-to script for this kind of analysis now. You can find more info about the script at Juan's blog (RaptIR) here.

Note that Phill Moore (@phillmoore) also has a RegRipper plugin built for this technique, as well. There are many options out there, so test which one you like best. You can find Phill's plugin here.

[UPDATE #01 05/01/2015]: A Python script has been created to automate the process detailed in this post. You can download Eric Opdyke's (@EricOpdyke) script here.

Let's revisit the basics...

A suspicious employee left your company on January 28, 2014. You'd like to know which files were most recently used (opened, saved) on the employee's system right before he/she left.

The Solution
Pull the user's NTUSER.DAT. Run RegRipper to easily output the the values within the Software\Microsoft\Windows\CurrentVersion\Explorer\RecentDocs subkey.

To give some background, the RecentDocs subkey will contain a few things:
  • Subkeys named after every file extension used on the system (e.g. a subkey for .zip, .doc, .mp4, etc.). 
    • Each of these subkeys will contain its own MRUListEx value that keeps track of the order in which files were opened. (Sound familiar? We also saw use of the MRUListEx value upon analyzing shellbags artifacts). 
    • Each subkey will store up to 10 numbered values; each numbered value represents a recently opened file with the extension found in the subkey's name.

    The .gif subkey showing the 10 most recently opened .gif files for a specific user. The order is stored in MRUListEx.

  • A subkey for folders most recently opened.
    • Can store up to 30 numbered values (i.e. the 30 most recently opened folders).
  • A root MRUListEx value
    • Stores the 149 most recently opened documents and folders across all file extensions.

With that, take note that MRU times are powerful; here's why:

Since every extension has its own subkey, we will have quite a few registry key LastWrite times with which to work. But so what? When dealing with RecentDocs, the LastWrite time only applies to the most recently opened file within that subkey. While this is certainly the case, we mustn't forget that we also have an all-encompassing MRUListEx in the root of the RecentDocs subkey.

So what does this mean? By using the LastWrite times of each subkey and the root MRUListEx list, we can more accurately determine when certain files were last opened. While we won't be able to get an EXACT time for non-MRU (i.e. files that are not the first entry in the MRUListEx value), we may be able to determine a time range in which these files were opened. Consider the following example:

Our root MRUListEx shows:

**All values printed in MRUList\MRUListEx order.
LastWrite Time Thu Feb  6 14:08:49 2014 (UTC)

  26 = smiley.gif
  99 = strange_bird.jpg
  42 = books_to_read.csv
  130 = secret_clients.csv
  55 = zip_passwords.rtf
  87 =
  3 = contacts.csv
  74 = phillip.csv
  72 = notes.txt


At this point, if we look at the first few entries of the root MRUListEx, we can really only tell that smiley.gif was last opened on Feb 6, 2014. So let's mark it:

  26 = smiley.gif ............................Feb  6 14:08:49 2014
  99 = strange_bird.jpg
  42 = books_to_read.csv
  130 = secret_clients.csv
  55 = zip_passwords.rtf
  87 =
  3 = contacts.csv
  74 = phillip.csv
  72 =

That secret_clients.csv looks pretty interesting. Let's look at the .csv subkey to see if we can find out when it was opened.

The .csv subkey's MRUListEx shows:

LastWrite Time Sat Jan 18 19:19:22 2014 (UTC)
MRUListEx = 6,2,1,7,3,5,4

  6 = books_to_read.csv  
  2 = secret_clients.csv
  1 =
  7 = contacts.csv
  3 = phillip.csv

Dang. Looks like the books_to_read.csv file is the most recently opened .csv, which means that the Jan 18 19:19:22 2014 LastWrite time of the .csv subkey is going to represent the last time that that file was opened -- not secret_clients.csv. I guess we can give up on this one...

...or...we could look at the surrounding files to glean some more information. First, let's mark the known LastWrite time in our root MRUListEx list.

  26 = smiley.gif ............................Feb  6 14:08:49 2014
  99 = strange_bird.jpg
  42 = books_to_read.csv...............Jan 18 19:19:22 2014
  130 = secret_clients.csv
  55 = zip_passwords.rtf
  87 = my_diet.csv
  3 = contacts.csv
  74 = phillip.csv
  72 =

Looking at our list, we can see that there are two open times that we know for sure. We can also see that there is another file that was opened between these times. It looks like strange_bird.jpg was opened after smiley.gif, but before books_to_read.csv. Since this list is an all-encompassing list, it would be safe to say that between January 18 and Feb 6, strange_bird.jpg was opened. But let's go ahead and confirm that.

The .jpg subkey's MRUListEx shows:

LastWrite Time Sat Jan 18 21:54:57 2014 (UTC)
MRUListEx = 4,2,3,1

  4 = strange_bird.jpg 
  2 = stone_tanooki.jpg
  3 = camel.jpg

  1 = fighting_frogs.jpg

It looks like strange_bird.jpg is the most recently opened .jpg on the system. That means that the Jan 18 21:54:57 2014 LastWrite time of the .jpg subkey represents the the time at which strange_bird.jpg was opened. Again, let's mark it in our root MRUListEx list:

  26 = smiley.gif ............................Feb  6 14:08:49 2014
  99 = strange_bird.jpg...................Jan 18 21:54:57 2014
  42 = books_to_read.csv...............Jan 18 19:19:22 2014
  130 = secret_clients.csv
  55 = zip_passwords.rtf
  87 =
  3 = contacts.csv
  74 = phillip.csv
  72 =

At this point, we still don't know when secret_clients.csv was last opened. Upon finding its LNK file, we note that it had been first opened in late 2013 (LNK creation time). If this machine was running XP, we would be able to grab the access time of this LNK file to see when it *might* have been last opened. Alas, we are using Windows 7 (disclaimer: there are tons of other things you should be looking at anyway; this is merely one example of why we would want to look at RecentDocs). Let's keep moving.

Taking a second look at our root MRUListEx list, we notice that the next item after secret_clients.csv is zip_password.rtf. That also looks interesting, so let's see when that was opened.

The .rtf subkey's MRUListEx shows:

LastWrite Time Sat Jan 17 14:15:31 2014 (UTC)
MRUListEx = 2,1

  2 = zip_passwords.rtf 
  1 = fiction_novel.rtf

It looks to be the first one, so we can use the LastWrite time to mark it in our list.

  26 = smiley.gif ............................Feb  6 14:08:49 2014
  99 = strange_bird.jpg...................Jan 18 21:54:57 2014
  42 = books_to_read.csv...............Jan 18 19:19:22 2014
  130 = secret_clients.csv
  55 = zip_passwords.rtf.................
Jan 17 14:15:31 2014
  87 = my_diet.csv
  3 = contacts.csv
  74 = phillip.csv
  72 =

With that, we have found a pretty tight time frame as to when secret_clients.csv was last opened. We don't have the exact date and time, but we can now say that is was last opened somewhere between Jan 17 and Jan 18.

It looks as though there are some more .csv files that were recently opened, and by the looks of it, that .txt file might be able to help us hone in on a time frame.

The .txt subkey's MRUListEx shows:

LastWrite Time Sat Jan 10 18:23:08 2014 (UTC)
MRUListEx = 4,5,2,1,3

  4 = notes.txt 
  5 = groceries.txt
  2 = how_to_make_toast.txt
  1 = my_novel.txt
  3 = address.txt

Just as before, let's mark the LastWrite time in our root MRUListEx list.

  26 = smiley.gif ............................Feb  6 14:08:49 2014
  99 = strange_bird.jpg...................Jan 18 21:54:57 2014
  42 = books_to_read.csv...............Jan 18 19:19:22 2014
  130 = secret_clients.csv
  55 = zip_passwords.rtf.................
Jan 17 14:15:31 2014
  87 = my_diet.csv
  3 = contacts.csv
  74 = phillip.csv
  72 =
notes.txt...............................Jan 10 18:23:08 2014

Again, we don't have an exact date and time as to when the my_diet.csv, contacts.csv, and phillip.csv files were last opened. But, based on the LastWrite time of the .txt subkey, we do know the time frame in which these files were opened.

Now imagine that a USB device was also plugged in at some point within this time frame. It is possible that the suspicious employee may have done a "Save As..." on contacts.csv and saved it to the external device. It is also possible that the employee opened the file and copy/pasted the information into an email. Of course, the possibilities are endless, but you get the point.

And it goes without saying that other artifacts, such as LNK files, jump lists, ComDlg32, OfficeDocs, and more should also be used to complement RecentDocs analysis. In particular, ComDlg32 would be of interest in this scenario, as it will show you files recently opened/saved. It offers more information (e.g. the executable used to open the file, the file's path, etc.) and stores saves/opens by extension, as well. As such, we can use this same technique to pinpoint file open/save times (though, it is a bit limited, as the * extension subkey will only show up to 20 entries).

This post quickly became longer than expected, but it's still a quickie. I'll end with this:

This is nothing new. All of this data is available to you in the Windows registry. This is a fairly basic technique, but it is also one that shouldn't be overlooked. And as always, use more than one artifact to make your case.
