Using Data Mining to Find Terrorists

Herb Edelstein

This column will look
at certain misconceptions about data analysis and data mining, and how those
technologies can be effective tools for investigators.

It was recently reported that a few days after the September 11 attacks, FBI agents visited one of the
largest providers of

consumer data. They did so to see if the 9/11 terrorists were in the database and
quickly found five of them. One of the terrorists had been in the country for less than two years, had 30
credit cards and a quarter million dollars' debt with a payment sch
edule of $9,800 per month.
Mohammed Atta, the ringleader, had also been here less than two years and had 12 addresses under the
names Mohammed Atta, Mohammed J. Atta, J. Atta and others. Surely, their report speculated, with
patterns like this, we can use
the databases we presently have to ferret out terrorists in our midst.
Unfortunately, the answer is, "It depends."

There are limitations in using these so
called patterns of the agents' observations. We need to ask, first,
how the records were found and, s
econd, if the observed characteristics are indeed repeated patterns or
merely isolated instances. Because I am not privy to any knowledge other than what was published in
the report, my analysis is based on surmise.

More than likely, the FBI started their

search with database queries using the suspected terrorists' names
and likely variants. They found the terrorists' records and then noticed the number of credit cards,
addresses and the amount of debt. However, they probably would not have known in advanc
e to look for
these attributes. Furthermore, the terrorists' records probably didn't show that they had been in the
country for only two years; that is knowledge the FBI brought to the search.

We also don't know how easily the observations generalize to ot
her terrorists or how many non
have these same attributes. Combing the database for people who have a number of credit cards, big
debts or multiple addresses would undoubtedly yield both criminals (most of whom aren't terrorists) and
perfectly i
nnocent folks.

The large number of addresses for Atta may be an even more difficult screening criterion to use,
considering that we don't know the names of unknown terrorists, let alone their aliases. It would be
nearly impossible to conduct an aggregatio
n across the hundreds of millions of individuals in this
database to calculate the number of addresses, especially because all a terrorist would have to do to
defeat such a search is use different aliases.


don't have enough known terrorists or a consist
ent set of behaviors to use data mining to build
predictive models. Thus, it would not be particularly productive to search for a signature.

If we can't inductively find a pattern from the data, perhaps we can just find exceptional behaviors
far from the norm to be worth investigating, such as 30 credit cards or 12 addresses. This
problem (called outlier detection) is easy if you're simply searching for something very different on one
dimension. It's much more difficult when you're looking for

combinations of attributes whose individual
values are typical, but which taken together are unusual. For example, being male or pregnant is not
unusual, but pregnant males are rather uncommon! It's even more difficult to find outliers in categorical
ables (data that fits in discrete classes) because the way to measure differences is not obvious. For
example, what is the measure of the difference between a Ford and a Chevy?

Another trap is that if you look at enough variables, sooner or later you'll f
ind at least one that correlates
well with what you are trying to predict. This is called a specification search. When you are searching
through large databases with many attributes, it is easy to find such false predictors. The problem of
relying on data
mining or query software as a primary line of defense is that it produces too many false

What is the best way to use databases, search technology and data mining? First, recognize that "data" is
more important than "mining." Resources should be
spent working with the existing databases and
setting up new ones that allow investigators to easily share information. Second, humans are more
important than computers. Once trained investigators have generated lists of suspects, it's time to follow

tracks through the databases to verify information and check whether apparent anomalies are
genuinely unusual and suspicious. Third, while the profiling and prediction aspects of data mining will
be of limited use, other techniques, such as those used for

finding fraud, will also help investigators
spread their nets beyond the original suspects. For example, visualizations and algorithms have been
used to locate doctors and lawyers who work together to defraud insurance providers. As investigations
help un
cover behaviors of terrorists that differentiate them from the rest of us, profiles that trigger
further investigations will emerge.

Thus, we cannot rely on the magic of data mining to find terrorists or protect us from attack. No
shortcuts can substitute

for careful investigative work supported by good databases and a management
structure that listens to and supports its investigators.

Herb Edelstein is an internationally recognized expert in data mining, data warehousing and CRM,
consulting to both comp
uter vendors and users. A popular speaker and teacher, he is also a co
