Identifying similar documents
MD 5 hashing technique has been used in Computer Forensics for years now. The primary uses are to insure integrity of evidence and also to identify identical files. Around 2006 Jesse Kornblum has
published a paper on "fuzzy hashing" (official name is "context triggered piecewise hashes") algorithm capable of identifying near identical files on a binary level. I remember my excitement at the time and today I still often use this technique sifting through the clutter etc.
Thanks to the advancements in Data Mining and Information Retrieval fields, computer forensics professionals now have another tools at their disposal. It is capable of identifying near duplicates and chained near duplicates based on the context of a document, not just a binary level. The algorithm is called
w-shingling or just shingling. Shingling does not attempt to understand the meaning of the text like some Natural Language Processing algorithms do, so its is language agnostic. It also works with the context of a document regardless of formatting (.pdf, .dot, .eml etc).
The algorithm may have a significant impact on the investigator's workflow because:
a) it is very effective at reducing the amount of irrelevant files, and
b) great at finding documents that are similar to the files, already identified as being relevant.
Some times we hear about admissibility problems caused by the use of black box algorithms in eDiscovery, most often associated with another cool technology "
predictive coding". Well, shingling is not part of this legal battle and the algorithm is not a "black box". In fact, most of us already using it on a daily basis when searching the Internet. The variations of the algorithm are used by search engines to ensure that the original content gets returned in response to our keyword query and the duplicates are omitted.
Shingling
Shingling is an effective and efficient method for comparing the sets of shingles in files containing text. As I previously mentioned, shingling doesn't rely on linguistics.
As once said by Albert Einstein
“If you can't explain it to a six year old, you don't understand it yourself.” Taking this literally, the algorithm extracts the plain text from a document and performs the following:
a) removes all characters except letters and digits and puts everything into lower case*
b) splits the text into tokens (overlapping groups of words, hence its name “
shingling”)
c) compares the sets of shingles generated from the documents to check how similar two documents are
Here is an example of w-shingling where w = 2 words:
Text:
"Apples Are Not Oranges."
Shingle size can vary, however w-shingle size 1 would produce a bag of unordered words, which can be found in many dissimilar documents. So, essentially by shingling we capture the relationship between the words. Once we created shingles, we need a method to compare them.
Jaccard similarity
Paul Jaccard (1868 - 1944) was a professor of botany and
plant physiology. He developed and published the Jaccard index of similarity in 1901. The method used to calculate the distribution of various plants in the alpine zone. It measures similarity between sample sets. Today the method is used in a wide range of disciplines including biology, genealogy, mathematics, computer science and we can now add computer forensics to this list.
Lets take two baskets of fruits, basket A and basket B
Both baskets A and B have a set of common fruit items, called the Intersection (denoted as ∩).
|
Click to enlarge |
A
set of unique items in either baskets
A and
B is called
the Union (denoted as
∪)
.
When two sets have one or more elements in common, these elements counted only once in their union set.
|
Click to enlarge |
Jaccard similarity index (lets denote it as Sim) is calculated based on the number of items in the intersection of A and B divided by the number of items in the union of A and B. The result ranges from 0 (with no elements in common) to 1 (identical items).
\[Sim(A,B) = \frac{|A ∩ B|}{|A ∪ B|}\]
In the above example we have:
$Sim(A,B) = \frac{|A ∩ B|}{|A ∪ B|} = \frac{3}{10} = 0.3$
Notes and notations: Objects separated by commas, inside parenthesis (A, B, C, D, E) denote an ordered set, or (A, B) ≠ (B, A). If the objects are in inside curly brackets {A, B, C, D, E}, they are unordered, or {A, B} = {B, A}
In mathmathics, a value between | | denotes the absolute value of a real number, which is always the non-negative value. Example | -3 | = 3. The cardinality uses the same notation. We mean the later in our example, where a number of elements in |A ∩ B| is divided by a number of elements in |A ∪ B|.
Here is another example:
Lets consider two sets A and B, where:
A = {0, 1, 2, 3} and B = {1, 2, 3, 4, 4, 5, 6, 7}
$Sim(A,B) = \frac{\{1, 2, 3\}}{\{0, 1, 2, 3, 4, 5, 6, 7\}} = \frac{3}{8} = 0.375$
Python has a built-in
class to calculate the union and intersection. Calculation of Jaccard similarity index can be done in just a few lines:
IntersectionAB = len(A.intersection(B))
UnionAB = len(A.union(B))
SimAB= IntersectionAB / float(UnionAB)
Jaccard Distance or how dissimilar two sets are, can be calculated as:
DistanceAB = (len(A.union(B)) - len(A.intersection(B)))/ float(len(A.union(B)))
An implementation of Jaccard distance is also
available in Python Natural Language Toolkit and could be implemented as follows:
from nltk.metrics.distance import Jaccard_distance
A = {0, 1, 2, 3}
B = {1, 2, 3, 4, 4, 5, 6, 7}
DistanceAB = jaccard_distance(A,B)
Jaccard similarity is simple yet effective method but not without limitations:
- Doesn't account for the frequency of the query term occurrence
- Unique words or combinations of words are more informative compared to commonly used one
- Longer documents produce more hits compared to short documents. I have seen a modified formula $Sim(A,B) = \frac{|A ∩ B|} {\sqrt{ | A ∪ B |}}$ been used for the length normalisation
Shingle size should be more than 1 word; take into account the length of the document and ideally the amount of commonly used words in the document. Its size should be selected to be sufficiently large to make sure the low probability of
an accidental collision in the selected document (s) [collision means similarity].
* Other considerations when implementing shingling are capitalisation, white spaces, punctuation, stop words, using natural language processing to identify nouns, verbs and synonyms etc.
When dealing with large sets of data, storing generated shingles becomes a challenge. To reduce it's size, a hash function can be used instead of the actual strings. Obviously, we would need something more suitable than MD5 hash. Actually, we want the algorithm that functionally is opposite to MD5 and closer in the outcome to "fuzzy hashing". "The min-wise independent permutations algorithm" or in short
MinHash is used for those purposes. I am going to leave MinHash for some other time, as this isn't necessary for understanding the algorithm of finding near-duplicate documents.
Currently, most commercial computer forensic tools are missing this future, though I can see some snippets of the functionality being implemented in the latest FTK Lab edition. By expanding its functionality,
NUIX 4.2 has become a new player in Computer Forensics field and brought from eDiscovery a new set of useful tools. My favourites so far are:
- solid support for various email formats (NSF, PST, EDB. etc.);
- fast and reliable indexing;
- and of course shingling, the ability to export shingle lists for use across the cases similarly to NSRL or MD5 hash sets, being able to identify near-duplicates and also find chained near-duplicates.
|
Chained near-duplicates |
UPDATE:
1 October 2014
We do a lot of cool stuff at work with this technology in
eDiscovery,
Computer Forensic and
Incident Response space. With
NUIX 6 just been released, shingling can now be used for log analysis (evtx, ISS, Apache etc). It is a game changer in Incident Response .. and hey, I now run NUIX on my MacBook Air natively.