Kenny Trytek Joe Briggie Abby Birkett Derek Woods

joeneetscompetitiveΑσφάλεια

3 Νοε 2013 (πριν από 4 χρόνια και 1 μήνα)

75 εμφανίσεις

Kenny
Trytek

Joe
Briggie

Abby
Birkett

Derek Woods

Advisor:

Simanta

Mitra


Client:

Matt Good,
Kingland

Systems

Problem Statement


Large companies have many layers of corporate hierarchy.


Financial and data records sometimes conflict between
various layers/entities.


Accurate and comprehensive company records are
needed
for auditing and stock conflict resolution.


There is a need for “Data Mastering”, to take multiple
conflicting sources of data
and determine
the reality
of

the matter.

Basic Requirements


System shall autonomously traverse publicly
available
websites and collect information


System shall store parsed information in a flat
file


System shall maintain a normalized
database


System shall expose functionality through web
services


A single run of system shall complete execution in
less than six
hours

Design Decisions


Implementation in C#


ASP.NET GUI with
jQuery

UI widgets


Operable in a Windows environment (XP or later)

Risks


Site data structures or hierarchies can change at any time


Reliance on third party PDF text parser, grid control, and
AJAX library


Inconsistencies in data

System Diagram

Flat File

Database

ETL

Tool

Normalized

Kingland

Data
Analyst UI

DAL

No Conflicts?

External Client
UI

Web Svcs.

WWW Data


Scraper


Tool

HTML Parser

PDF Parser

Create

Read

Update

Delete

Harvester Module



The harvester performs the

work of gathering data from

the external sites


After the data is scraped and parsed,

the harvester constructs XML

files for each data source


Finally, the ETL is notified the data is ready


Scraper


Flat File

(XML)

World

Wide

Web

Parser


PDF

Parser

HTML

Parser

Harvester Difficulties


Constructing a POST request to retrieve the PDFs
required extracting a complex view state


Difficult to extract text from PDF


Inconsistencies in extracted text


City names were occasionally malformed


Extra formatting characters were present in

extracted text

ETL (Extract, Transform, Load)


The ETL performs

cleanup operations

on the data from

the harvester


If there are malformed tags or invalid characters, they
are escaped here


Maintains an error log


Loads data into database through DAL (Data

Access Layer)


ETL Tool


DAL


Flat File

(XML)

ETL Difficulties


Implementing multi
-
threaded execution for

better performance


Dealing with malformed input

DAL (Data Access Layer)


Maintains a normalized

MySql

database


Provides CRUD operations

(Create, Read, Update, Delete)




No particular difficulties

encountered in database creation





D A L


Database


User Interface


ETL

Tool


Add()

Find
()

Update()

Delete()

DAL Difficulties

Web Services


Expose the DAL for access


from external web apps


Accessed by HTTP GET or POST


requests


Returns JSON objects containing data




Returning large JSON objects to the UI


Services


Read()


Progress()


Write()


Update()


Delete()

Web Services Difficulties

GUI (Graphical User Interface)

GUI Difficulties


Implementing auto complete functionality


for query efficiency


Progress bar updates


Grid configuration and updating


Retrieving large amounts of data
from web services

Overall Test Plan


Test each module individually to ensure independent
functionality


As modules are completed, test integration pairs to
ensure channel adequacy


When all modules are integrated, test system

end
-
to
-
end using web app

Harvester / Parser Test Plan


Ensure harvester can connect to site for scraping and
retrieve the appropriate data


Maintain a list of input files that produce specific
output after parsing


Define corner cases for sub
-
function robustness
evaluation / testing


Ensure errors are caught and handled appropriately

ETL Test Plan


Maintain a list of input files that produce specific
output after data cleanup


Ensure errors are caught and handled appropriately


Confirm ETL can talk to DAL

DAL Test Plan


Ensure database can have records created, read,
updated, and deleted


Define corner cases and error handling for invalid
database operations


Create list of operations with expected results

Web Services Test Plan


Call each web service with expected input and check
return values


Call web services with invalid input and check

return values

Project Future


Database model can be generalized to include any
number of data sources


Harvester can be separated from ETL so additional
data sources will not require ETL change


Optimization / multithreading of harvester and parser
for greater efficiency


User access control features in web application


Two separate GUIs: one for
Kingland

clients, and one
for
Kingland

data analysts

Questions?