Reverse Engineering Index.dat
2003-01-03 : A website visitor asked if I had a tool to extract URLs from index.dat generated by IE. What? Well, no. But I'm sure I could make one if I wanted to. I took a look at index.dat to see what I could figure out. Here are the notes I took as I analyzed the file structure.


Start by dumping it with hexview. Look at the beginning. There is a text string identifying the file type. Further down we see the directory names.

After that there are large chunks of FFs and 00s - they don't tell us much. I've seen them in a lot of files. They are probably indexes or something. Let's skip over that for now.

Scroll until we see interesting data

URLs are easy to pick out, and they occur with chunks of data between them. These are probably records. We just need to figure out how to find the beginning and the end of a record. Looking for things we can recognize, there is a URL, the file name on disk (without path), and then a bunch of http headers. There is a very interesting pattern after the text: "0D F0 AD 0B" or "0BADF00D" when viewed as a dword. I recognize this as a special "garbage" pattern used to fill uninitialized memory by some memory managers. This will make life easier, because we can see what memory is "not used", that is, was never written to by the app once it was returned by the memory manager.

Looking at the end of the strings, they seem to be "00" terminated, which is what you would expect for a C string. After the URL, the file name appears to be aligned on a dword boundary. Confirming this, we see that bits of "0BADF00D" are showing through the cracks between strings.

As the data seems to be variable length, there has to be some sort of info that can be used to walk across the data structure from record to record. Either absolute pointers or relative pointers (or lengths or offsets, etc). It just about has to be relative pointers, since this is in a file.

Things are very well aligned, so this could be a memory mapped file.

Another thing that jumps out is the word "URL" in the middle of the data between records. We also see "REDR". In fact, URL has a space after it, making it a DWORD. If we look further, there is another tag, "HASH" but it too seems to have a trailing space, which would make it an odd size, which is weird. "URL" seems to be a cached file, "REDR" probably stands for redirection. Thus it probably has a link to the URL it redirects to. Also, REDRs are much shorter. The tag seems to come before the data it describes. A common way to structure data is to have records in the form "type, length, data" This is a linked list which can be walked bery quickly. It allows variable sized data, and you don't even have to understand the data to be able to skip past it.

Looking at a couple of small REDR records, the unused memory occupies all the space from the end of the URL text to the next tag. The tags seem to be on 0x80 byte boundaries. I don't see any smaller records. Looking at the DWORD after the tag of these small records, I find "01 00 00 00" consistently. Well, the small records are 1 block of 0x80 bytes. Looking at other records, it is consistent!

Now, we know how to walk across this part of the data structure, and we know the size of the records. We also know that there is no data after the URL text. This makes sense. A common tactic in C is to create a struct to overlay on the beginning of a variable size record. That struct will then have pointers to the beginning of the variable length bits that follow it.

Looking at our anomolous HASH, if we treat as a four byte tag and the next four bytes as an int, we get a record size of 0x20. If we look down 0x80 * 0x20, we find the first URL after the HASH data. Good, that means HASH does follow our rule. I don't see any other tags between the beginning of the file and the HASH tag, just large data sections. So we don't know how to find the HASH yet except by inspection.

Looking at the data in the HASH section, it seems pretty regular. Some columns are all zeros, and some columns are all 00 or 80. We know our tagged records are on 0x80 byte boundaries. We would expect to find pointers to records in the hash table. A hash table should be an array indexed by the hash value so the records within the table should be small, probably just single pointers. I pick an entry beginning with "80" at random, say "80 9F 00 00". These could be offsets from anywhere, but we'll guess the beginning. Checking, at index 0x00009F80 from the beginning of the file, there is a URL tag. This supports the idea that these are offsets to records. However we could have gotten lucky about the offset starting from the beggining of the file. We'd have to see if there were any enties in the table less than the offset to the first URL tag, which is not easy to do by inspection so we'll leave it.

Luckily, I found a second index.dat in another spot. This one only has three entries. However, it is still 32K long. The round size again suggests something memory mapped. Looking at the end, the last record points to the middle of a 0BADF00D region right after the record. That region is two 0x80 blocks long. After that is all zeros. Probably somewhere in the header is information describing the various lengths.

The HASH tag is at a different place in the small file: 000004000 instead of 00005000 in the large file. So, the header is not a fixed size. There must be an offset in the header indicating where the HASH is. Glancing at the file header, the second dword after the text string at the very beginning (at 0x00000020) is "00 40 00 00" or 0x00004000. Checking the other file, the second dword is "00 50 00 00". Bingo! At least, it's pretty likely. We can walk all the tags with no further ado, if we're right. Are there any more "00 40 00 00"'s in the header of the small file? No, not before the directory names. That also supports our hypotheses. It could be one of the other numbers if an additional offset or multiplier was required, but this is probably right.

We also know that the most general or most important or most over-arching data will tend to be at the beginning of a struct. I wonder what the first dword is? A file size or a version number maybe? The small file has 0x8000. The file size is 32K! A likely match. What about the other? The large file has 0x6CC000. The file size is 7,127,040! Right on, we've identified another field.

Looking at the small file, at the HASH data itself, it is mostly "03 00 00 00"s but there are three places with other data, which is a good sign. However, the "smudges" in the table of 3's are two dwords wide, not one. This suprised me, because I thought every dword in the table was an offset. However, paying closer attention to that data in the tables makes it clear that each entry in the hash table is two dwords. the +7 and +15 columns are all zeros, and the +4 and +12 columns are all 00 or 80. The +0 and +3 columns and +8 and +11 columns are clearly NOT all 00 and 80. However, +0 and +8 seem to still have relatively few values - 00 80 40 C0, plus some 5's: 05, 85, 45. Any C5s? Hmm a C4! and an 04 and a C5 finally. Probably flags or offsets to something of regular size. Or both even. Plus, with a 0x3 in every dword slot, the 2nd dword in every pair is clearly not always an offset to a tag! However, the entire low nibble can be used for flags, since we know that the actual offset must be a multiple of 0x80. With these three entries, we can confirm that the offsets are from the beginning of the file though. The offsets are 0x00005000, 5100, and 5200. These exactly match the three URL records. That confirms that the offsets are from the beginning of the file. I can't see any pattern in the first dwords in the three pairs, though. I would expect to see flags or maybe chaining pointers, but the values don't look like offsets of any kind.

There are other two dwords before the beginning of the table. In the small file, they are both zero. They could just be for alignment. The large file has a nonzero value in the first of the two - "00 C0 01 00". I know there is at least one more HASH record in the large file. What's at 0x001C000? The next HASH record! OK, the hashtables are chained together through that dword. The second hash table also has a value in the second dword - "01 00 00 00". Perhaps just a counter indicating that this is the second hash table? Something to investigate later.

The hashtable in the small file has a trailing chunk of zeros. The non-zero portion is 0xE00 bytes long, or 0x1C0 (448) pairs. That's an odd size - 0x40 (64) pairs off the nearest round number of pairs. The other two hash tables are the same. The zeros are probably just padding. 64 must be an interesting unit in the hash algorithm, because adding another 64 pairs would collide with the first URL record. (Why not make the hash 0x21 blocks long and make the table 512 pairs? I don't know. 448 isn't exactly prime.)

Glancing at the header in the small file, I don't see any 3's to match the number of URLs cached. The dword at 0x48 is the number of directories. 0x10 (16) is probably the max, since the huge file has 7 megs of URL data but only 16 directories. The directory names are 8 bytes long (with no terminator), which makes for nice alignment. The dword before the file name seems to have the number of files in that directory.

There's a chunk of zeros after the directory names (0x110 to 0x1C0?). Then a chunk of numbers (0x1C0? to 0x250). Then a bunch of "FF"s. In the small file, there are four "FF"s, a "3F", and then 00s to the HASH tag. The large file has FFs for quite a while, (could this be a bitmap?) then it gets patchy around 0x15C0, and then becomes mostly 00s at 0x19C0 though there are occasional blips. This is a good candidate for a bitmap describing with blocks have records and which are free. Let's confirm this. "FF FF FF FF 3F" is 32+6=38 "on" bits in a row. Times 0x80 is 0x1300. The HASH starts at 0x4000. Thus the first unallocated block should start at 0x5300. Bingo, that is the empty block pointed at by the last URL record. The two "off" bits in the byte that had some "on" bits correspond to the two blocks of 0BADF00D, I bet. I bet that the block manager wrote 0BADF00D into the entire range represented by the byte in the bitmap when it started allocating blocks from that byte. The bytes that have never been touched have the zeros provided by the OS memory manager.

Let's look at the REDR record. There are two unknown dwords before the URL text. It's unlikely there would be a pointer to the URL text since the start is fixed relative to the start of the record. The first dword could be a number since the high byte is consistently zero. Some examples are 0x00005208, 0x0019B0A8, 0x001E211E, 0x00005828. They are probably not direct indices, since they are not multiples of 0x80. The second dword is probably flags, since the examples have a more even bit distribution: 0xDB195C00, 0x3C2B0F80, 0x8CED25C0, 0x05F28E40 No ideas so far.

Let's look at the URL record. We've got 24 dwords to figure out. Well, one seems to be consistently 0BADF00D, so 23. I take that back. #24 is sometimes 0BADF00D (which means that the value was never set) but sometimes it has a value. Let's use offsets from the beginning of the record. So, +0 is the tag "URL ", +4 is the number of 0x80 byte blocks used by this record. +64 is sometimes 0BADF00D.

Let's start with something easy. The offset to the URL text is fixed, but the offset to the file name (which comes next) is not fixed, and we should be able to find it in one of the fields. Looking at an example, the offset of the file name is +A4. The only field with 0xA4 is at +3C. Next is the http headers. The field holding this offset is at +44. There's one dword between those, and it looks like it could be a length. Counting, it doesn't add up.

Where do we stand?
+00 dw Tag
+04 dw Length In Blocks
+08
+0C
+10
+14
+18
+1C
+20
+24
+28
+2C
+30
+34
+38
+3C dw offset to file name
+40
+44 dw offset to http headers
+48
+4C
+50
+54
+58
+5C
+60
+64 dw ??, sometimes uninitialized
+68 sz Url

Looking at what Windows Explorer has to say about the files in the cache, the colums include a "name" which is everything after the last slash, so there might be an index to that, an expires date, and last modified date, a last accessed date, and a last checked date. Last modified and last accessed could be retrieved from the file system, but looking at the strange values for last modified (which is often sent in the http headers), the data must be coming from this file. so we have three to four dates to find. Dates would probably be stored in NT timestamp format, which is a qword (long). Picking one URL at random, I look it up in explorer. Looking at the data, I only see two longs that look like timestamps - +8 and +10. Lets look at +8: "00 14 AC 72 BE 95 C2 01" or 0x01C295BE72AC1400. We'll ask w32tm to translate: "w32tm /ntte 0x01C295BE72AC1400" returns "146792 02:41:12.0000000 - 11/26/2002 8:41:12 PM (local time)". That matches with "last modified" in explorer. Let's try the other one. "10 BA 4D B3 AE 99 C2 01" or 0x01C299AEB34DBA10 is "12/1/2002 8:58:33 PM", which matches "last accessed".

Looking at the file size of 2K, I don't see any dwords with a value near 0x0800, so the file size is probably not stored in the record. I take that back. 0x676 is probably close enough to read as a size. It seems to match pretty well. NT can support file sizes in the qword range, but I'm going to be conservative and guess that this field is only a dword wide.

The dword at +18 seems to be zero when the expires field is none, so they are probably related, but the value in the filed doesn't look like a time. Maybe it's in seconds?

BTW, since the beginning of the URL record is a timestamp, maybe the beginning of the REDR record is also a timestamp? No, the qwords at +8 in the REDR records menthoned above do not look like timestamps - The high bits should be the same (0x01C...) but they definitely are not.

I would expect the cache directory index to be one of the fields as well. I wonder which it is?

Here's where we stand now:
+00 dw Tag
+04 dw Length In Blocks
+08 qw Last modified
+10 qw Last accessed
+18 ?w expires
+1C
+20 dw? size
+24
+28
+2C
+30
+34
+38
+3C dw offset to file name
+40
+44 dw offset to http headers
+48
+4C
+50
+54
+58
+5C
+60
+64 dw ??, sometimes uninitialized
+68 sz Url


2003-01-04 : Thinking about this last night, I realized you couldn't just use the block lengths to walk the whole list of blocks, because there are discontinuities. You would need to use the bitmap at the beginning of the file to know whether the next block was valid or contained garbage.


2003-04-05 : Jeff Newman directed me to the page by Wolfgang Baudisch describing the IE4 cache layout for the answer to my last question. Baudisch identifies the field containing the cache directory index as being at offset +38 for IE5. It is a one byte field, which helps explain why I didn't find it earlier. :)

For the record, the files I am analyzing here are Client UrlCache MMF Ver 5.2.


2003-09-14 : Perhaps a more specific example is in order. Here is a hex dump from the middle of one of the index files on my hard drive. Mouse over the description on the right to highlight the corresponding bytes on the left.

00007280: 55 52 4C 20 03 00 00 00 00 23 C6 9D B4 97 C1 01 | URL .....#......
00007290: 30 A3 07 9B 48 63 C3 01 00 00 00 00 00 00 00 00 | 0...Hc..........
000072A0: 78 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | x...............
000072B0: 60 00 00 00 68 00 00 00 01 00 10 10 8C 00 00 00 | `...h...........
000072C0: 41 00 00 00 98 00 00 00 CE 00 00 00 00 00 00 00 | A...............
000072D0: 0F 2F 14 82 01 00 00 00 00 00 00 00 0F 2F 14 82 | ./.........../..
000072E0: 00 00 00 00 0D F0 AD 0B 68 74 74 70 3A 2F 2F 77 | ........http://w
000072F0: 77 77 2E 6D 73 6E 62 63 2E 63 6F 6D 2F 6D 2F 6A | ww.msnbc.com/m/j
00007300: 73 2F 6D 61 72 71 2E 6A 73 00 AD 0B 6D 61 72 71 | s/marq.js...marq
00007310: 5B 31 5D 2E 6A 73 00 0B 48 54 54 50 2F 31 2E 31 | [1].js..HTTP/1.1
00007320: 20 32 30 30 20 4F 4B 0D 0A 43 6F 6E 74 65 6E 74 |  200 OK..Content
00007330: 2D 4C 65 6E 67 74 68 3A 20 31 31 34 34 0D 0A 43 | -Length: 1144..C
00007340: 6F 6E 74 65 6E 74 2D 54 79 70 65 3A 20 61 70 70 | ontent-Type: app
00007350: 6C 69 63 61 74 69 6F 6E 2F 78 2D 6A 61 76 61 73 | lication/x-javas
00007360: 63 72 69 70 74 0D 0A 45 54 61 67 3A 20 22 34 30 | cript..ETag: "40
00007370: 32 36 66 63 39 64 62 34 39 37 63 31 31 3A 35 30 | 26fc9db497c11:50
00007380: 36 22 0D 0A 58 2D 50 6F 77 65 72 65 64 2D 42 79 | 6"..X-Powered-By
00007390: 3A 20 41 53 50 2E 4E 45 54 0D 0A 50 33 50 3A 20 | : ASP.NET..P3P: 
000073A0: 43 50 3D 22 42 55 53 20 43 55 52 20 43 4F 4E 6F | CP="BUS CUR CONo
000073B0: 20 46 49 4E 20 49 56 44 6F 20 4F 4E 4C 20 4F 55 |  FIN IVDo ONL OU
000073C0: 52 20 50 48 59 20 53 41 4D 6F 20 54 45 4C 6F 22 | R PHY SAMo TELo"
000073D0: 0D 0A 0D 0A 7E 55 3A 6C 6F 75 69 73 20 74 68 6F | ....~U:louis tho
000073E0: 6D 61 73 0D 0A 00 AD 0B 0D F0 AD 0B 0D F0 AD 0B | mas.............
000073F0: 0D F0 AD 0B 0D F0 AD 0B 0D F0 AD 0B 0D F0 AD 0B | ................
+00 dw Tag
+04 dw Length In Blocks
+08 qw Last modified
+10 qw Last accessed
+18 ?w expires
+1C
+20 dw? size
+24
+28
+2C
+30
+34
+38 b cache directory index
+3C dw offset to file name = 0x8C
+40
+44 dw offset to http headers = 0x98
+48
+4C
+50
+54
+58
+5C
+60
+64 dw ??, sometimes uninitialized
+68 sz Url
+8C sz File Name (offset not fixed)
+98 sz Http headers (offset not fixed)
+178 sz Padding (offset not fixed)
C o m m e n t s :     updated: 2014-06-13 (76 days ago)
Edit