Face and Fingerprint

brasscoffeeΤεχνίτη Νοημοσύνη και Ρομποτική

17 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

78 εμφανίσεις

Face and Fingerprint

Identification System

Use Case of Hadoop Framework

By Dedy Sutanto

Disclaimer


This presentation is more focusing on how to use
Hadoop to attack a problem


This presentation is more for learning purpose


This presentation won’t present the algorithm of
image recognition


This presentation is just a hobby for me



Face and Fingerprint Identification
System


Hadoop Framework


Benefits and Conclusion

Face and Fingerprint
Identification System


System will store photo and fingerprint of individual


At least 3 images per individual (example)


1 images of face


2 images of fingerprint (left and right thumb)


System need to do search and match base input


Input is images of face or thumb fingerprint


One type input per task


System’s output is possibility match of identification (match
face/fingerprint)


Growing system. Start small then grow as required


Assumptions


Assumption #1: Per image = 500KB


Assumption #2: System need to handle up to 100
millions individual


Assumption #3: Processing time for searching and
matching per images files is 100ms



The Challenges


Storage


Per individual require 3 images: 3 x 500KB =
1500KB/individual


Storage required: 1500KB x 100000000 = 150000000000KB
= 150 TeraBytes


Processing time


With 100 millions image files and processing time per image
is 100ms, sequential processing to match one image will
require: 100ms x 100000000 = 20000000000ms ~ 2777
hours!!!


Possible Solutions


SAN


This will lead into sequential processing method which have
long processing time per matching task ~ 2777 hours


Database


I am not really aware about database with 150TB data


We still need to resolve how the storage strategy for 150TB


Hadoop


In this presentation, we will how to use Hadoop to attack the
problem




Face and Fingerprint Identification
System


Hadoop Framework


Benefits and Conclusion

Apache Hadoop


From http://hadoop.apache.org

“The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets
across clusters of computers using a simple programming
model. It is designed to scale up from single servers to
thousands of machines, each offering local computation
and storage. Rather than rely on hardware to deliver high
-
availability, the library itself is designed to detect and
handle failures at the application layer, so delivering a
highly
-
available service on top of a cluster of computers,
each of which may be prone to failures.”

Hadoop Good at


Text mining


Search Quality


Graph creation and analysis


Pattern Recognition


Collaboration filtering


Prediction models


Sentiment analysis


Risk assessment


Power by Hadoop


Yahoo!


More than 100,000 CPUs in >40,000 computers running Hadoop


Our biggest cluster: 4500 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM)


Used to support research for Ad Systems and Web Search


Also used to do scaling tests to support development of Hadoop on larger clusters


Hadoop Korean User Group


a Korean Local Community Team Page.


50 node cluster In the Korea university network environment.


Pentium 4 PC, HDFS 4TB Storage


Used for development projects


Retrieving and Analyzing Biomedical Knowledge


Latent Semantic Analysis, Collaborative Filtering


IBM


Blue Cloud Computing Clusters


University Initiative to Address Internet
-
Scale Computing Challenges

Typical Hadoop Cluster


Components


Name Node


JobTracker


TaskTracker


Data Node


Data Nodes


Total requirement 150TB


4 x 1 TB = 4 TB per node


150 / 4 = 37.5 nodes


With 3 times replication


37.5 x 3 = 112.5 nodes ~ 120 nodes


With 20 nodes per rack (1U server)


120 / 20 = 6 racks for final solutions

Processing Times


Rough calculation:


With the matching process distribute into 120 nodes,
the processing time will reduce into: 2777 / 120 ~ 24
hours/task


Increase in processing time ~ 116 times

Linear Scalability


System can start will small footprint (1 rack) with 10
nodes


Simply adding data nodes when capacity expansion
required



Hardware Cost


Commodity hardware for DataNode:


8 cores (2 x Quad Cores)


16 GB Memoy


4 x 1 TB


GigaEthernet


Simple hardware cost calculation:


Assumption price per server node ~ USD 3000


Total: 3000 x 120 nodes = USD 360K



Face and Fingerprint Identification
System


Hadoop Framework


Benefits and Conclusions

Benefits


Parallel and distributed computation


Storage redundancy and replication


Growing system (start small)


Commodity hardware


Further optimization will improve the processing
time

Conclusions


Hadoop framework offer parallel computing with
huge amount of data


Adoption of commodity hardware make entry point
for Hadoop cluster relatively inexpensive


Hadoop cluster is linear scalable. Easy to expand by
just adding new node


Disclaimer: this presentation is more focusing on
how to use Hadoop to attack a problem



Terima Kasih