## Wednesday, November 20, 2013

### Windows Forensic Live CD

Previously, making Windows based Forensic Live CD was not for everyone, mostly due to the amount of tinkering involved. WinXP and Win7 based Live CD's also have problems with writing a Windows drive signature to write-protected drives.

Mini-WinFE project has changed this.  Creating a Forensic Live CD with Mini-WinFE is done in a few mouse-clicks. Windows 8 and 8.1 also appear not to write a drive signature to the wire-protected disk.
From my experience Windows 8.1 Enterprise based Live CD has some issues when adding custom programs to it. The Win 8.1 Pro version works perfectly well.

The boot time is about a minute longer compared to Linux based Live CD's but you get driver and app flexibility with Windows.

TrueCrypt is missing in the default app selection. I had to spend a half an hour to fix it.
Below are scripts to add TrueCrypt 7.1a to the Live CD.

TrueCrypt must be downloaded first and extracted, not installed on the machine (though it may work also, but I haven't tested it)

TrueCrypt.script must be placed to \Mini-WinFE\Projects\WinFE\Programs folder

MD5 (TrueCrypt.script) = 383c5a68888e258f0954c009f813b3ed

MD5 (bblean1.17.1.script) = 75115b21edf70501fe329cb911c80e66

Then just follow the instructions to create your Forensic Live CD and you are done.

## Monday, June 24, 2013

APT or Advanced persistent threat is usually associated with governments or organised groups of hacktivists.

It is no secret however, that in the Digital Age Organised Crime has established its own presence in cyberspace. Conventional Street gang’s strength depends on their access to disposable foot soldiers willing to take the greatest risks.

Organised crime specialising in cybercrimes recruit hackers in place of foot solders. Countries with high unemployment, low wages, immature legislation and incompetent law-enforcement are breeding grounds for them. While these groups can become a gun-for-hire and be involved in APT, they normally operate on “the cash must flow” principle. This means, they must move on to another target when unsuccessful at hacking for a ‘reasonable’ period of time. The effort and persistence usually depends on the complexity of hacking or potential reward.

Having access to the skilful hackers makes these groups capable of launching sophisticated attacks, but these attacks are less persistent compared to hacktivism motivated or .gov sponsored attacks. The problem is that organised cybercrime groups are attracted to the potentially high reward targets. As a result, the different and uncoordinated groups of hackers constantly attack these targets, creating an Advanced Cyber Threat Environment (ACTE) for successful businesses and financial institutions.

Smaller and less successful businesses are also operating in ACT Environment in these countries *. In the Wild West era, it used to be a Colt, but these days DDoS attacks are often invoked by the competition as its most convincing argument. The effective DDoS attack protection relies on expert knowledge and good understanding of the company’s core business, not software or hardware. Small businesses don’t normally have the capacity.

I haven't specifically named the countries with Advanced Cyber Threat Environment and left this for you to decide.

## Identifying similar documents

MD 5 hashing technique has been used in Computer Forensics for years now. The primary uses are to insure integrity of evidence and also to identify identical files. Around 2006 Jesse Kornblum has published a paper on "fuzzy hashing" (official name is "context triggered piecewise hashes") algorithm capable of identifying near identical files on a binary level. I remember my excitement at the time and today I still often use this technique sifting through the clutter etc.

Thanks to the advancements in Data Mining and Information Retrieval fields, computer forensics professionals now have another tools at their disposal. It is capable of identifying near duplicates and chained near duplicates based on the context of a document, not just a binary level. The algorithm is called w-shingling or just shingling. Shingling does not attempt to understand the meaning of the text like some Natural Language Processing algorithms do, so its is language agnostic. It also works with the context of a document regardless of formatting (.pdf, .dot, .eml etc).

The algorithm may have a significant impact on the investigator's workflow because:
a) it is very effective at reducing the amount of irrelevant files, and
b) great at finding documents that are similar to the files, already identified as being relevant.

Some times we hear about admissibility problems caused by the use of black box algorithms in eDiscovery, most often associated with another cool technology "predictive coding". Well, shingling is not part of this legal battle and the algorithm is not a "black box". In fact, most of us already using it on a daily basis when searching the Internet. The variations of the algorithm are used by search engines to ensure that the original content gets returned in response to our keyword query and the duplicates are omitted.

## Shingling

Shingling is an effective and efficient method for comparing the sets of shingles in files containing text.  As I previously mentioned, shingling doesn't rely on linguistics.

As once said by Albert Einstein “If you can't explain it to a six year old, you don't understand it yourself.” Taking this literally, the algorithm extracts the plain text from a document and performs the following:

a) removes all characters except letters and digits and puts everything into lower case*
b) splits the text into tokens (overlapping groups of words, hence its name “shingling”)
c) compares the sets of shingles generated from the documents to check how similar two documents are

Here is an example of w-shingling where w = 2 words:

Text: "Apples Are Not Oranges."

Shingle size can vary, however w-shingle size 1 would produce a bag of unordered words, which can be found in many dissimilar documents. So, essentially by shingling we capture the relationship between the words. Once we created shingles, we need a method to compare them.

## Jaccard similarity

Paul Jaccard (1868 - 1944) was a professor of botany and plant physiology. He developed and published the Jaccard index of similarity in 1901. The method used to calculate the distribution of various plants in the alpine zone. It measures similarity between sample sets. Today the method is used in a wide range of disciplines including biology, genealogy, mathematics, computer science and we can now add computer forensics to this list.

Both baskets A and B have a set of common fruit items, called the Intersection (denoted as ).

 Click to enlarge

A set of unique items in either baskets A and B is called the Union (denoted as )
When two sets have one or more elements in common, these elements counted only once in their union set.

 Click to enlarge

Jaccard similarity index (lets denote it as Sim) is calculated based on the number of items in the intersection of A and B divided by the number of items in the union of A and B. The result ranges from 0 (with no elements in common) to 1 (identical items).

$Sim(A,B) = \frac{|A ∩ B|}{|A ∪ B|}$

In the above example we have:

$Sim(A,B) = \frac{|A ∩ B|}{|A ∪ B|} = \frac{3}{10} = 0.3$

## Notes and notations: Objects separated by commas, inside parenthesis (A, B, C, D, E) denote an ordered set, or (A, B) ≠ (B, A). If the objects are in inside curly brackets {A, B, C, D, E}, they are unordered, or {A, B} = {B, A}In mathmathics, a value between |  | denotes the absolute value of a real number, which is always the non-negative value. Example | -3 | = 3.  The cardinality uses the same notation. We mean the later in our example,  where a number of elements in  |A ∩  B| is divided by a number of elements in |A ∪  B|.

Here is another example:

Lets consider two sets A and B, where:

A = {0, 1, 2, 3}   and    B = {1, 2, 3, 4, 4, 5, 6, 7}

$Sim(A,B) = \frac{\{1, 2, 3\}}{\{0, 1, 2, 3, 4, 5, 6, 7\}} = \frac{3}{8} = 0.375$

Python has a built-in class to calculate the union and intersection. Calculation of Jaccard similarity index can be done in just a few lines:

IntersectionAB = len(A.intersection(B))
UnionAB = len(A.union(B))
SimAB= IntersectionAB / float(UnionAB)

Jaccard Distance or how dissimilar two sets are, can be calculated as:

DistanceAB = (len(A.union(B)) - len(A.intersection(B)))/ float(len(A.union(B)))

An implementation of Jaccard distance is also available in Python Natural Language Toolkit and could be implemented as follows:

from nltk.metrics.distance import Jaccard_distance

A = {0, 1, 2, 3}

B = {1, 2, 3, 4, 4, 5, 6, 7}
DistanceAB = jaccard_distance(A,B)

Jaccard similarity is simple yet effective method but not without limitations:
• Doesn't account for the frequency of the query term occurrence
• Unique words or combinations of words are more informative compared to commonly used one
• Longer documents produce more hits compared to short documents. I have seen a modified formula $Sim(A,B) = \frac{|A ∩ B|} {\sqrt{ | A ∪ B |}}$ been used for the length normalisation
Shingle size should be more than 1 word; take into account the length of the document and ideally the amount of commonly used words in the document. Its size should be selected to be sufficiently large to make sure the low probability of an accidental collision in the selected document (s) [collision means similarity].

* Other considerations when implementing shingling are capitalisation, white spaces, punctuation, stop words, using natural language processing to identify nouns, verbs and synonyms etc.

When dealing with large sets of data, storing generated shingles becomes a challenge. To reduce it's size, a hash function can be used instead of the actual strings. Obviously, we would need something more suitable than MD5 hash. Actually, we want the algorithm that functionally is opposite to MD5 and closer in the outcome to "fuzzy hashing".  "The min-wise independent permutations algorithm" or in short MinHash is used for those purposes. I am going to leave MinHash for some other time, as this isn't necessary for understanding the algorithm of finding near-duplicate documents.

Currently, most commercial computer forensic tools are missing this future, though I can see some snippets of the functionality being implemented in the latest FTK Lab edition.  By expanding its functionality, NUIX 4.2 has become a new player in Computer Forensics field and brought from eDiscovery a new set of useful tools. My favourites so far are:

• solid support for various email formats (NSF, PST, EDB. etc.);
• fast and reliable indexing;
• and of course shingling, the ability to export shingle lists for use across the cases similarly to NSRL or MD5 hash sets, being able to identify near-duplicates and also find chained near-duplicates.

 Chained near-duplicates
UPDATE:
1 October 2014

We do a lot of cool stuff at work with this technology in eDiscovery, Computer Forensic and Incident Response space. With NUIX 6 just been released, shingling can now be used for log analysis (evtx, ISS, Apache etc). It is a game changer in Incident Response .. and hey, I now run NUIX on my MacBook Air natively.