Cloud Computing, Web Mining, and Business Intelligence

meatcologneInternet και Εφαρμογές Web

3 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

101 εμφανίσεις

1

© 2005



Cloud Computing, Web Mining, and
Business Intelligence



Hsinchun Chen, Ph.D., Director, Artificial Intelligence Lab

Director, NSF COPLINK and Dark Web Research Centers

University of Arizona


Acknowledgements: NSF, DOD, LOC, DHS, DOJ, FBI, CIA

2

© 2005


Cloud Computing Overview


3

© 2005


Cloud Computing Overview




Cloud computing: applications, system software, and
hardware delivered as services over the Internet.



Service oriented architecture + virtualization + utility
computing



Software as a Service (SaaS), Infrastructure as a
Service (IaaS), Platform as a Service (PaaS)



Web services, data centers



Public cloud, private cloud, community/hybrid cloud



Major public cloud service providers: Amazon, Google,
Saleforce; IBM, Microsoft, HP, VMWare

4

© 2005


What is a Real Cloud?




Hardware: 10,000+ servers, EBs storage, virtualization,
green computing



Software (as Services and Development Platforms):
storage management, DBMS, collaboration tool, web
servers, etc.



Web Services and APIs: SOAP, REST, RSS, JSON,
AJAX, Java, Python, http, SMTP, Apache server,
BigTable, etc.



Utility Metering: account management, flexibility, cost
effectiveness, security, compliance



Parallelization and Grid Computing: 1,000+ processors,
MapReduce, Hadoop, etc.



SCALE, FLEXIBILITY, WEB SERVICES, PARALLELIZATION

5

© 2005


Cloud Computing Features




Pros: Agility, cost, device/location independence, multi
-
tenancy, reliability, scalability, security, maintenance,
metering




Cons: Privacy, security, compliance, availability,
performance, sustainability


We are all in the cloud with our daily email activities:
Google Gmail (191M users), Yahoo! Mail (280M),
Microsoft Live Hotmail (360M)!!!



6

© 2005


Cloud Computing Industry/Academic
Platforms




Google + IBM: Hadoop; Apache open source, inspired
by Google’s MapReduce and Google File System (GFS)




HP + Intel + Yahoo: Hadoop; housed at UIUC




Microsoft: Azure; Live Services, SQL Azure (formerly
SQL Services), AppFabric (formerly .NET Services),
SharePoint Services, and Dynamic CRM Services




UIUC, Teragrid at Purdue U, Gordon system at SDSC,
FutureGrid at Indiana U (grid and data
-
intensive
computing)


7

© 2005


Cloud Computing in Developing
Economics




IBM has established cloud computing centers in China,
India, Vietnam, Brazil, and South Korea




E
-
Education: IBM Cloud Academy, virtual computing
labs, research support (data & computing)



E
-
Health: IBM Healthcare Industry Solution Lab in
Beijing, healthcare data sharing and analytics, EMR,
personalized medicine and insurance, health 2.0



E
-
Commerce: global supply
-
chain management,
banking, telecommunications, IT hosting, business
intelligence and analytics



E
-
Government: government data sources, services


8

© 2005


Cloud Computing Opportunities




Access to IT infrastructure, applications, and data




Productivity and flexibility gains




Cloud
-
related entrepreneurial activities




Building applications on established cloud platforms




Developing new products and services




Data
-
intensive business intelligence and analytics

9

© 2005

Major Commercial Could Computing
Platforms


Amazon Elastic Compute Cloud (EC2):

allow users to
rent virtual computers on which to run their own
computer applications. In addition, Simple Storage
Service (S3) provides online storage web service.



Google App Engine:

a platform for developing and
hosting Java or Python
-
based web applications. Google
Bigtable

is used for backend data storage.



Windows Azure:

a platform that provides services like
SQL Azure and SharePoint services, and allows
.Net

Framework applications be run on the platform.

11/3/2013

9

10

© 2005

Hadoop Architecture: MapReduce

(programming simplicity)

11/3/2013

11

© 2005

Hadoop Architecture: NameNode & DataNode

(power of parallelization)

11/3/2013

11

12

© 2005

Nutch Search Engine Architecture (on Hadoop)

11/3/2013

13

© 2005

AI Lab Hadoop Cluster


We have setup a
Hadoop

cluster in E
-
commerce Lab with the
following configurations:


Host
hadoop
-
master
:
the node that acts as both
JobTracker

and
NameNode
.



Host
hadoop
-
slave
-
ipxxx
:

the 13 slave nodes that all act as
TaskTracker

and
DataNode
.



All 14 nodes are physically the same, with the following physical
specification:


CPU: AMD Opteron 246, 2GHz, 2 core



Memory: 3 GB



This cluster suffices for our initial testing purpose. If we are to
extend the cluster to a larger scale, we may need to have two more
powerful servers acting as
NameNode

and
JobTracker

respectively.

11/3/2013

13

14

© 2005


Web Services Overview


15

© 2005

What is Web Services?


Web Services:


A new way of reuse/integrate third party softwre or legacy
system


No matter where the software is, what platform it residents,
or which language it was written in


Based on XML and Internet protocols (HTTP, SMTP

)


Benefits:


Ease of integration


Develop applications faster


16

© 2005

Web Services Architecture


Simple Object Access Protocol (SOAP)


Web Service Description Language (WSDL)


Universal Description, Discovery and Integration
(UDDI)

17

© 2005

New Breeds of Web Services


Representational State Transfer (REST)


Use HTTP Get method to invoke remote services (not XML)


The response of remote service can be in XML or any textual format


Benefits: Easy to develop; Easy to debug; Leverage existing web
application infrastructure


Really Simple Syndication (RSS, Atom)


XML
-
based standard


Designed for news
-
oriented websites to “Push” content to readers


Excellent to monitor new content from websites


JavaScript Object Notation (JSON)


Lightweight data
-
interchange format


Human readable and writable and also machine friendly


Wide support from most languages (Java, C, C#, PHP, Ruby, Python…)


18

© 2005

Rich Interactivity Web
-

AJAX


AJAX: Asynchronous JavaScript + XML


AJAX incorporates:


standards
-
based presentation using
XHTML and CSS
;


dynamic display and interaction using the
Document Object Model
;


data interchange and manipulation using
XML and XSLT
;


asynchronous data retrieval using
XMLHttpRequest
;


and
JavaScript

binding everything together.



Examples:


http://www.gmail.com


http://www.kiko.com



More info: http://www.adaptivepath.com/publications/essays/archives/000385.php

19

© 2005

AJAX Application Model

20

© 2005

Amazon Web Services (AWS)


Amazon E
-
Commerce Service


Search catalog, retrieve product information, images and customer reviews


Retrieve wish list, wedding registry…


Search seller and offer


Alexa Services


Retrieve information such as site rank, traffic rank, thumbnail, related sites
amount others given a target URL


Amazon Historical Pricing


Programmatic access to over three years of actual sales data


Amazon Simple Queue and Storage Service


A distributed resource manager to store web services results


Amazon Elastic Compute Cloud (EC2)


Sell computing capacity by the amount you use


21

© 2005

Google Web APIs


Google has a long list of APIs:
http://code.google.com/apis/


Google Search


AJAX Search API


SOAP Search API (deprecated)


Custom search engine with Google Co
-
op


Google Map API


Google Data API (GData)


Blogger, Google Base, Calendar, Gmail, Spreadsheets, and a lot
more


Google Talk XMPP for communication and IM


Google Translate


Google App Engine

22

© 2005

eBay API


Buyers:


Get the current list of eBay categories


View information about items listed on eBay


Display eBay listings on other sites


Leave feedback about other users at the conclusion of a
commerce transaction


Sellers:


Submit items for listing on eBay


Get high bidder information for items you are selling


Retrieve lists of items a particular user is currently selling
through eBay


Retrieve lists of items a particular user has bid on

23

© 2005

Other Services/APIs Providers


Yahoo!
http://developer.yahoo.com/


Search (web, news, video, audio, image…)


Flickr, del.icio.us, MyWeb, Answers API


Windows Live Services
http://msdn2.microsoft.com/en
-
us/live/default.aspx


Search (SOAP, REST)


Spaces (blog), Virtual Earth, Live ID


Wikipedia


Downloadable database

http://en.wikipedia.org/wiki/Wikipedia:Technical_FAQ#Is_it_possible_to_download_the_contents_of _Wikipedia.3F


Many more at Programmableweb.com


http://www.programmableweb.com/apis

24

© 2005

Services by Category


Search


Google, MSN, Yahoo


E
-
Commerce


Amazon, Ebay, Google Checkout


TechBargain, DealSea, FatWallet


Mapping


Google, Yahoo!, Microsoft


Community


Blogger, MySpace, MyWeb


del.icio.us, StumbleUpon


Photo/ Video


YouTube, Google Video, Flckr


Identity/ Authentication


Microsoft, Google, Yahoo


News


Various news feed websites including Reuters, Yahoo! and many more.



25

© 2005

Mashup

A Novel Form of Web Reuse



A mashup is a website or application that combines
content from more than one source into an integrated
experience.”


Wikipedia


API X + API Y = mashup Z


Business model: Advertisement

26

© 2005


Web Services and Business
Opportunities (With or Without Cloud)


27

© 2005


50 Business Opportunities

(“
Business Web Mining Using Amazon, Google, eBay, and Google”)




Retailing and e
-
Services:

iRelocate

RealTomatoes

SmallBH

HobbyCentral

NewPlaceSeek

College Advisor

Friendly Gifter

C
l
ipper

GottaCouch

SkiStop

vTrack

Barter Bay

Link
-
US

Smart Gift Card

Timely Bid

Tucson Gamer Caf
é

TV and More
Deliverables

Cellphone Intelligent Auctioning

Tucson Book Exchange

SciBubble

Wish Sky

GiftChannel

PriceSmart

WetYourWhistle




Sports and Entertainment:

BetSmart

XTREME F1

MLB

100Yards

CricWeb


iBollywood

Sa Ri Ga Ma

WOW

Bollywood

Funzic

HinduShrines


Indiapaaru

NachBaliye

Movie Location Quest

Remakes

SugarSuite


MusicBox

Artist Connection

Concerto

Star Search





Government and Social Works:

RepCheck

SmallNGreenCars

Change of Base

iDog

Tasty Park

iSupport


28

© 2005


Sa Ri Ga Ma




Mahalakshmi Sundararajan, Pavithra Ravi, Sahana
Nagaraja



Carnatic Music: One of the two main genres of Indian
classical music; Mostly performed vocally



Sarigama.com: one stop information portal for carnatic
music


29

© 2005


Sa Ri Ga Ma




Sarigama.com latest news and RSS Feeds



Artist information



Transliteration



Music play and video



Shopping



Lessons and Library



Concert locator



Forums



Interactive Features



Tag Clouds



Lyrics Recommender system

30

© 2005


Sa Ri Ga Ma


31

© 2005


Sa Ri Ga Ma


32

© 2005


Sa Ri Ga Ma


33

© 2005


SmallNGreenCars




Kumar Vakeel, Kunal Jain, Neeraj Munshi



SmallNGreenCars.com will provide information,
recommendation and purchasing options to the users.



This website is our way of helping Earth become a
better place to live.

34

© 2005



Unique Concept



Global customers



Youtube vehicle videos



Flickr vehicle photos



Google Maps and Local Search



Google visualization



RSS feeds of global vehicle news



Facebook recommendation from friends



Yahoo Finance for currency exchange



Google Translate for web pages



Recommendation System



Fuel Efficiency Challenge



SmallNGreenCars


35

© 2005


SmallNGreenCars


36

© 2005


SmallNGreenCars


37

© 2005


SmallNGreenCars


38

© 2005


BetSmart




Cai Chen, Ximing Yu, Ying Jin



BetSmart focuses on guiding betting in Series A Soccer
of Italy, one of the top soccer leagues in the world.



The website provides all the information needed for
checking games and doing betting, including news, club
and player information, matches statistics and predictions
of coming matches.


39

© 2005



Soccer news



Club and player information



Matches statistics



Matches predictions



BetExplorer betting information



Youtube soccer videos



Flickr player photos



Google Translate for news translation



Google Visualization



Betting prediction with Weka



Google App Engine


BetSmart


40

© 2005



Google App Engine (GAE) lies in the Google Cloud



Current free quota is 500 MBs and 5M page views
per month



Java and Python runtime



Java Persistence API (JPA)



Google Bigtable



Task queues (executed in parallel)



URL fetch, Memcache



Google accounts



Flexible and configurable



No DBMS support


BetSmart


41

© 2005


BetSmart


42

© 2005


BetSmart


43

© 2005


BetSmart


44

© 2005


RealTomatoes




Merry Rusli, Shanna Leonard, Devipsita Bhattacharya



Consumer demand for organic food and local food has
been increasing rapidly in the US. In the wake of food
scares such as salmonella in fresh spinach and melamine
contaminated cookies, more and more US customers
want to know where their food is coming from and be sure
that the food they eat is local, fresh, and organic.



RealTomatoes

aims to become a "one
-
stop
-
shop" for
information on organic and locally produced food
--

to
connect people with the best local sources for healthy,
delicious eating, whether dining out or cooking at home.

45

© 2005


RealTomatoes




One stop site for local organic, local, vegetarian,
and healthy food



Organic and healthy restaurant, farms, farmers’
markets, CSA listing and map display




Localized Farm/CSA search based on IP geolocation



Geo
-
local calendar of farmer’s markets and events



Customer education and information



Customer
-
to
-
customer and business
-
to
-
customer
social networking



Restaurant recommendation to user



Amazon Cloud allows rapid expansion of capacity



Amazon EC2 cloud utilizing LAMP (Linux, Apache,
mySQL, and PHP) stack


46

© 2005


RealTomatoes


Real
Tomatoes
Search

Database

Data
Integration
Tool

XML Parsing
Tools

RSS Feed


Vegguide.org

Web Spider

Websites

YQL (Yahoo
Query
Language)

Yahoo Local
Search

Alchemy
HTML API

Web Spider

47

© 2005


Real

Tomatoes


48

© 2005


RealTomatoes


49

© 2005


Business Intelligence and Analytics:

BizCloud Opportunities


50

© 2005


BIG and FAST




Business data from TBs to PBs



The Big Data Era; “speed to insight”



Barnes & Noble + Aster Data + MapReduce; McAfee +
Datameer + Hadoop; AdKnowledge + Greenplum +
Hadoop + Amazon EC2




MapReduce/Hadoop vs. Parallel DMBSs: ETL and
“read once” data sets; complex and powerful analytics;
semi
-
structured data; quick
-
and
-
dirty analyses; limited
-
budget operations; fault tolerance; performances


51

© 2005


Business Intelligence and Analytics




$3B BI revenue in 2009 (Gartner, 2006)



The Data Deluge (The Economists, March 2010);
internet traffic 667 Exabytes by 2013, Cisco; Total amount
of information in 2010, 1.2 Zettabyte (KB
-
MB
-
GB
-
TB
-
PB
-
EB
-
ZB
-
YB)



$9.4B BI software spending in 2010 and $14.1B by
2014 (Forrester)



IBM spent $14B in BI in five years; $9B BI revenue in
2010 (USA Today, November 2010); 24 acquisitions,
10,000 BI software developers, 8,000 BI consultants, 200
BI mathematicians


52

© 2005


Business Intelligence and Analytics




BI: “skills, technologies, applications, and practices used
to help an enterprise better understand its business and
market.”



Technologies: data warehousing; Extraction,
Transformation, and Load(ETL); Business Performance
Management (BPM); visual dashboards; and advanced
knowledge discovery using data and text mining



BI 2.0: web intelligence, web analytics, web 2.0, social
media analytics, opinion mining; cloud computing and
web services; real
-
time monitoring and mining; enterprise
performances (marketing/accounting/finance/healthcare)

53

© 2005

CS Ecosystem and Impacts

54

© 2005

Data, Text, and Web Mining


Data Mining: ID3, neural networks, genetic
algorithms, SVM; Weka, SPSS, SAS, Microsoft
SQL server data mining, IBM Intelligent Miner,
IBM Cognos



World Wide Web: ftp, http/html, browser, digital
library, search engines; Mosaic, Alta Vista, Lycos,
Yahoo, Google



Social Media: collaboration, participation, filtering,
multimedia, social networks; Facebook, Youtube,
Twitter, Second Life

55

© 2005

Data, Text, and Web Mining

56

© 2005

56

Data Mining Models and Methods

57

© 2005

57

Data Mining: A KDD Process

58

© 2005

59

© 2005

Hsinchun Chen, MIS Dept.

hchen@eller.arizona.edu …


Slides and Project information
available at:

http://ai.arizona.edu