Face and Fingerprint

brasscoffeeAI and Robotics

Nov 17, 2013 (4 years and 7 months ago)


Face and Fingerprint

Identification System

Use Case of Hadoop Framework

By Dedy Sutanto


This presentation is more focusing on how to use
Hadoop to attack a problem

This presentation is more for learning purpose

This presentation won’t present the algorithm of
image recognition

This presentation is just a hobby for me

Face and Fingerprint Identification

Hadoop Framework

Benefits and Conclusion

Face and Fingerprint
Identification System

System will store photo and fingerprint of individual

At least 3 images per individual (example)

1 images of face

2 images of fingerprint (left and right thumb)

System need to do search and match base input

Input is images of face or thumb fingerprint

One type input per task

System’s output is possibility match of identification (match

Growing system. Start small then grow as required


Assumption #1: Per image = 500KB

Assumption #2: System need to handle up to 100
millions individual

Assumption #3: Processing time for searching and
matching per images files is 100ms

The Challenges


Per individual require 3 images: 3 x 500KB =

Storage required: 1500KB x 100000000 = 150000000000KB
= 150 TeraBytes

Processing time

With 100 millions image files and processing time per image
is 100ms, sequential processing to match one image will
require: 100ms x 100000000 = 20000000000ms ~ 2777

Possible Solutions


This will lead into sequential processing method which have
long processing time per matching task ~ 2777 hours


I am not really aware about database with 150TB data

We still need to resolve how the storage strategy for 150TB


In this presentation, we will how to use Hadoop to attack the

Face and Fingerprint Identification

Hadoop Framework

Benefits and Conclusion

Apache Hadoop

From http://hadoop.apache.org

“The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets
across clusters of computers using a simple programming
model. It is designed to scale up from single servers to
thousands of machines, each offering local computation
and storage. Rather than rely on hardware to deliver high
availability, the library itself is designed to detect and
handle failures at the application layer, so delivering a
available service on top of a cluster of computers,
each of which may be prone to failures.”

Hadoop Good at

Text mining

Search Quality

Graph creation and analysis

Pattern Recognition

Collaboration filtering

Prediction models

Sentiment analysis

Risk assessment

Power by Hadoop


More than 100,000 CPUs in >40,000 computers running Hadoop

Our biggest cluster: 4500 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM)

Used to support research for Ad Systems and Web Search

Also used to do scaling tests to support development of Hadoop on larger clusters

Hadoop Korean User Group

a Korean Local Community Team Page.

50 node cluster In the Korea university network environment.

Pentium 4 PC, HDFS 4TB Storage

Used for development projects

Retrieving and Analyzing Biomedical Knowledge

Latent Semantic Analysis, Collaborative Filtering


Blue Cloud Computing Clusters

University Initiative to Address Internet
Scale Computing Challenges

Typical Hadoop Cluster


Name Node



Data Node

Data Nodes

Total requirement 150TB

4 x 1 TB = 4 TB per node

150 / 4 = 37.5 nodes

With 3 times replication

37.5 x 3 = 112.5 nodes ~ 120 nodes

With 20 nodes per rack (1U server)

120 / 20 = 6 racks for final solutions

Processing Times

Rough calculation:

With the matching process distribute into 120 nodes,
the processing time will reduce into: 2777 / 120 ~ 24

Increase in processing time ~ 116 times

Linear Scalability

System can start will small footprint (1 rack) with 10

Simply adding data nodes when capacity expansion

Hardware Cost

Commodity hardware for DataNode:

8 cores (2 x Quad Cores)

16 GB Memoy

4 x 1 TB


Simple hardware cost calculation:

Assumption price per server node ~ USD 3000

Total: 3000 x 120 nodes = USD 360K

Face and Fingerprint Identification

Hadoop Framework

Benefits and Conclusions


Parallel and distributed computation

Storage redundancy and replication

Growing system (start small)

Commodity hardware

Further optimization will improve the processing


Hadoop framework offer parallel computing with
huge amount of data

Adoption of commodity hardware make entry point
for Hadoop cluster relatively inexpensive

Hadoop cluster is linear scalable. Easy to expand by
just adding new node

Disclaimer: this presentation is more focusing on
how to use Hadoop to attack a problem

Terima Kasih