PageRank Calculation using Sparse Matrix in Clustered Computer Environment

desirespraytownSoftware and s/w Development

Dec 1, 2013 (3 years and 8 months ago)

154 views


0









PageRank Calculation using Sparse Matrix in Clustered
Computer Environment








Kwan Dong Kim, Chang Min Kim
























1

Table of Contents


1.

Introduction

-------------------------------------------------------------

1

2.

PageRank

---------
-------------------------------------------------------

1

A.

Definition

------------------------------------------------------------

1

B.

PageRank Algorithm

-----------------------------------------------

2

3.

Sparse Matrix Representation

--------------------------
----------

4

A.

Definition

------------------------------------------------------------

4

B.

Compression Algorithm

--------------------------------------------

5

4.

Hardware Environment

----------------------------------------------

6

A.

V2 Cluster
--------------------
----------------------------------------

6

B.

Parallel Java

--------------------------------------------------------

7

C.

Parallel Java in V2 Cluster

-----------------------------------------

7

5.

PageRank Iteration in Cluster

-------------------------------------

9

6.

Experiments and Results

------------------------------------------

11

A.

Methodology

--------------------------------------------------------

11

B.

Results and Conclusion

--------------------------------------------

12

7.

Acknowledgement

-------------------------
---------------------------

12

8.

Reference

----------------------------------------------------------------

12

9.

Appendix

-----------------------------------------------------------------

13

A.

Establishing SSL Connection and Running MPI program

---
----

13

B.

Shell
scripts

--------------------------------------------------------

15

C.

Java Codes

---------------------------------------------------------

20



1

1. Introduction

The purpose of the Web Laboratory is to provide data and computing tools for research
about the W
eb and the information on the Web. “It is funded in part by National Science
Foundation grants CNS
-
0403340, DUE
-
0127308, SES
-
0537606, and IIS 0634677. The
Web Lab is an NSF Next Generation Cyberinfrastructure project”
[1].
To achieve this goal,
three teams
will work for Web Lab project: Index, User Interface and PageRank teams.
The index team provides the full text indexing for other teams using Linux cluster server,
the user interface team takes charge of the user interface and pre
-
processing of data for
th
e PageRank. And the PageRank team works for compressed sparse matrix and
parallel programming to calculate large scale PageRank. This document is the project
report for the PageRank team therefore it will explain the several methods to calculate
large scal
e PageRank. And then it will show the results of the experiment which tests the
performance and the scalability of this algorithm.


2. PageRank

A document on the web can be measured typically by two metrics, “relevance” and
“importance”. Relevance is based

on comparison between the document terms and
query terms, while importance is based on the estimation of popularity of the documents.
Although two metrics are used as one combined metric in practice by search engines,
importance by popularity is considere
d to be the key metric to measure the rank of a
document on the web. Getting relevance metric requires all the documents in a set to be
compared with the query terms but it is neither practically possible nor desirable for the
web due to the cost and the c
haracteristics of the web such as growing number of
documents and quality of the documents on the web. It is formally rather suitable for
controlled collection of documents. Among several popularity measure of a page on the
web, “PageRank” provides most su
itable and efficient semantics and algorithm.

A. Definition

Brin and Page (1998) suggested “PageRank” as a measure of estimating the popularity
of a web page
[2]
. PageRank calculates stochastic probability to reach a certain web page
based on the number of

in
-
links to a page. “PageRank is basically modified version of
Pinski and Narin’s influence weights applied to the web graph” (Arms, 2006). P
ageRank
calculation contains an iteration of matrix multiplications and additions but

it is proven to
converge in
reasonable amount of time and the time does not depend on the size of input
data, i.e. the number of pages that are calculated. This is major reason why PageRank
can be applied to a large set of documents such as those on the web unlike other
algorithms su
ch as “Hubs and Authorities” (Kleinberg, 1997)
[3]
.


2


B. PageRank Algorithm

PageRank essentially calculates the stochastic probability to reach a page based on
“Random Surfer” model. In this model, a user
surfs

around the web without any restriction
except

an assumption
.

A
nd
the
possibility to reach a page from another page is purely
decided by the number of links in the page the surfer is in. In the view point of the page
being reached by user, if the page has many in
-
links, the page gets to have higher
pos
sibility to be reached and thus gets higher PageRank. What random surfer model
assumes is that a user may follow a link on the page he is in or he may jump to another
page randomly. So, he begins surfing with the same possibility of reaching any page in
th
e set of documents, virtually whole web pages.

Suppose the number of all pages in the set is n.

Then W
0
is a vector with every element

1
n
, which is the probability to reach each page in
the vector.

Then set up a square matrix B tha
t contains all the link information of the pages in the set.
The column index represents the page a link is from pages and the row index represents
the page a link is to pages. Each cell in the matrix contains 1 if there is a link between
those two (from a
nd to pages). Then normalize it by dividing each column by the total
number of links in the column so that each cell can contain the possibility to reach a page
in the row from a page in column. A page which does not have any link to other pages is
called
“a dangling node” and the value for the cell is

1
n
, because if a surfer reaches the
page he will jump to another page randomly because there is no out link in that page.



W
0

= [¼, ¼, ¼, ¼]





B =





* page 3 is a dangling page


The probabilities to reach each page after one step from the beginning page without a
random jump to another page can be calculated by W
1

= B * W
0
’. If we consider a random

1

2

3

4

1

0

½

¼

1/3

2

½

0

¼

1/3

3

0

0

¼

1/3

4

½

½

¼

0


3

jump from starting pag
e with probability 1
-
d (d is damping factor), then the equation for
W
0

will be


W
1

= d * (B * W
0
’) + (1
-
d) * W
0
’, and

W
2

= d * (B * W
0
’) + (1
-
d) * W
1


W
3

= d * (B * W
0
’) + (1
-
d) * W
2


.

.

.

W
k

= d * (B * W
0
’) + (1
-
d) * W
k
-
1



The sum of every element in
each W is 1 because the sum is the probability to reach any
page in the set. Every element in W will eventually converge and W will be the PageRank
of the pages in the set.

The convergence to a unique vector for any given staring vector W
0

is the feature
of
Markov Property because matrix B is stochastic, irreducible, and aperiodic (Arms, 2006).


B is a dense matrix but in the real world, a typical web page will not have more than
several hundred links at most and that is unusual. In our experiment from Ama
zon
dataset the average out
-
link in a page was , and the page with most link was . Under this
condition, the iterating matrix multiplication on dense matrix will reduce the performance
of PageRank calculation significantly. We can construct sparse matrix b
y re
-
writing


C = S + (1/n) * e * a’






---

1


Then, C considers random jump from dangling pages.

S is the initial B matrix before adding 1/n for damping pages, e is a vector with every
element 1, and “a” is a vector with elements 1 if the pages for the

entries in the vector are
dangling otherwise 0. If we define a Matrix L


L = d * C + (1/n)(1/n) * E





---

2


Then L considers random jump from any page by user choice.

E is a square Matrix with every element 1. Then


W
k

= L * W
k
-
1
’,







---

3


If w
e substitute L with equation 1 and 2


4


W
k

= (d * C + (1
-
d)(1/n) * E) * W
k
-
1



= dSW
k
-
1
’ + d(1/n)ea’ * W
k
-
1
’ + (1
-
d)(1/n)EW
k
-
1



= dSW
k
-
1
’ + d(1/n)ea’ * W
k
-
1
’ + (1
-
d)(1/n)e


---

4

(note EW
k
-
1
is e)


There is no dense matrix in this equation, so iterati
on with this equation is significantly
more efficient when the size of the set gets larger. We used equation 4 for our
experiments.


However, as PageRank team is working on, if we want to calculate PageRank on a very
large set of pages, there would be a me
mory space problem. Constructing a huge sparse
matrix is not desirable in terms of either memory utilization or computational efficiency.
We can transform this sparse matrix to a compressed sparse matrix using the sparse
matrix representation technique, wh
ich will be discussed in chapter 3.


But even with this representation, in
-
memory computation is impractical when the data
set is too large. So we need to move on to clustered computing environment (Considering
the large number of web pages, this should be

done rather than need to be done). By
using “Message Passing Interface (MPI)”, multiple nodes can exchange relevant portion
of data and can compute on them concurrently. MPI is very expensive in terms of speed,
but considering the size of real data, this
is inevitable. Hardware environment and setup
for clustered computing will be discussed in chapter 4.


In this clustered computing environment using MPI, we need to slightly modify the
algorithm so that necessary information for each node can be effectivel
y and lossless
distributed and gathered among the nodes. The actual process of the algorithm and
problems in clustered machines will be discussed in chapter 5.


3. Sparse Matrix Representation

A. Definition

A sparse matrix is a matrix that consists of pr
imarily zeros. The sparse matrix calculation
is important to compute the PageRank because the PageRank calculation is
fundamentally the combination of several sparse matrix calculations. Therefore if we can
deal with the matrix more effectively, we can pro
cess bigger PageRank calculation. Fig
-
1
is example of sparse matrix. It contain a lot of “0”s and few numbers which are greater
than “0”.


5


index

1

2

3

4

5

6

7

8

9

1

0

0

1

0

0

0

0

0

0

2

0

0

0

0

0

0

0

0

0

3

1

0

0

1

0

0

0

2

0

4

0

0

0

0

0

0

0

0

0

5

0

0

0

0

0

0

0

1

0

6

.

.

.

.

.

.

.

.

.





(Fig
-

1) Example of sparse matrix


As we mentioned, the PageRank calculation is the combination of several sparse matrix
calculations. Therefore if there are 100 pages, we have to calculate 100 x 100 matrix to
calcula
te PageRank without compression. For the Web Lab project, there are
approximately 13 million pages therefore at least a 13 million x 13 million matrices must
be processed. If we use a regular array, vector or linked list, the enormous memory space
is requi
red to calculate the PageRank. For example, if there are one million pages and
average out link for each page is approximately 10 links. If we use regular array structure
and each cell in the array needs 4 byte, 4 byte x one million x one million is 4 x 10

^ 12
byte therefore 4 x 10 ^ 12 byte is required to store that matrix. However when we ignore
all zero values and only store the values greater than zero, we only need 4x 10 million (4
x one million x 10). As a result, the compressed spare matrix algorith
m saves enormous
memory space when we calculate the PageRank.



B. Compression Algorithm

Three linked lists are used to implement the sparse matrix algorithm. First linked list is the
row linked
-
list and it represents the row index. Second linked list is t
he column
-
linked list
and it represents the column index. Third linked list is the value linked
-
list and it
represents the value of each cell

[4]
.Fig
-
2 shows the structure of the compressed sparse
matrix algorithm and the array representation is a equivale
nt structure in Fig
-
3 with the
compressed sparse matrix representation in Fig
-
2.



6


(Fig
-

2) Compressed sparse matrix representation


index

1

2

3

4

5

6

7

8

9

1

0

1

2

0

0

0

1

1

2

2

0

0

0

0

0

0

0

0

0

3

1

0

3

1

0

0

0

0

2

4

0

3

0

4

0

0

0

0

0

5

0

0

0

0

0

0

0

5

0

6

.

.

.

.

.

.

.

.

.





(Fig
-

3) Regular Matrix Representation


4. Hardware Environment

A. V2 Cluster


Even if we use the compressed sparse matrix representation, 4 x 10 ^ 12 is still huge to
calculate on a single m
achine therefore we decide to process them in the Multiple node
Linux cluster servers. The Cornell Theory Center has several hundred nodes of Linux
cluster servers therefore we can use this system. To calculate PageRank using cluster
servers, we develop th
e program that runs in this multiple Linux cluster servers. It divides
all of jobs and then processes each job in the separated nodes simultaneously. And then
each node creates the result when the job is finished. All of these results are returned to

.

.

.

.

.

.

.

0

1

2

4

0

5

.

.

8

4

2

9

4

3

1

9

8

7

3

2

.

.

5

4

3

2

1

3

1

2

1

1

2

1


Row List

Column List

Value List

Number of column at
the first row


1 2 3 4 5 6 7


7

Linux

login machine when all of processes are finished. As a result, These results are
merged together in the Linux login machine.


B. Parallel Java

As we mentioned, Linux cluster sever in the Cornell theory center will be used to process
our PageRank program.
When we look at the manual in the Cornell theory center for
Parallel programming, only C and Fortran language are supported in this system.
However our goal is to develop the Java PageRank program that runs on Linux cluster
servers in the Cornell Theory ce
nter therefore we were looking for several Java packages
which support parallel programming.


The parallel programming java packages are listed below:

1. mpiJava(
http://aspen.ucs.indiana.
edu/pss/HPJava/mpiJava.html
)

2. Open MPI(
http://www.open
-
mpi.org/
)

3. OpenMP API(
http://docs.sun.com/app/docs/doc/819
-
3694
)

4. MPICH2(
http://www
-
unix.mcs.anl.gov/mpi/mpich/
)

5. Parallel Java(
http://www.cs.rit.edu/~ark/pj.shtml
)


Although there are many Java packages supporting parallel programming, we se
lect
Parallel Java (PJ) implemented by professor Alan Kaminsky at Rochester Institute of
Technology. Because this package is well documented and there are several examples
to follow easily.


There are several important methods and programs in the Parallel
Java package:
“Comm”

and “Buffers” methods, “JobScheduler”, “Backend” and “Frontend” programs.

“Comm”

and “Buffers” methods are used to communicate between the backend and the
frontend. And “JobScheduler” assigns all job to each node(frontend or backend).
The
frontend node controls all of backend nodes and gathers all results from the backend
nodes and merges them together. The backend processes the actual data and send
them back to the frontend node. These “JobScheduler”, “Backend” and “Frontend”
programs

are required when we run our parallel program. (Refer to Parallel Java docs)

[5]
.


C. Parallel Java in V2 Cluster

To set up the appropriate environment for the parallel program, it is essential to
understand the structure of the Linux cluster servers in C
ornell Theory Center. There are
a lot of systems in the Cornell Theory Center, however we only use three different

8

systems; Linux Login Machine, File Server and V2 Linux Cluster Servers. The overall
structure of system is shown in Fig
-
4. This system only a
llow the client to connect to Linux
Login machine at first. And then the user can connect to the V2 Linux Cluster Servers by
using SSH command when they log in Linux Login machine. The user directory and files
are stored and managed in the File Server. Whe
n the client connects to the Linux Login
machine, the user directory from the File Server is mounted automatically. Also when the
client connects to V2 Linux Cluster Servers, the user directory from the File Server is
mounted automatically too. Therefore u
ser can use same file system all around the
system include the Linux login and V2 Linux Cluster servers
[6]
.



(Fig
-
4) The structure of Linux Cluster Servers


As we mentioned, Linux cluster servers in Cornell Theory center only s
upport C and
Fortran language. Therefore we need a special set up for Parallel Java(PJ) package.
There are five essential conditions to run the Parallel Java package in the Linux cluster
servers.


These conditions are listed below:

1.

Java package must be in
stalled in each node(V2 Linux Cluster Servers)

2.

Each node can communicate by using SSH without any authentication

3.

Packages must be deployed in each node

Client

Linux Login

Machine

File Server


V2 Linux Cluster Servers


Vii0001


Vii0002

………

Vii000k



9

4.

Job Scheduler should be running in each node

5.

The program should be developed as it explained in the PJ d
ocument


To accomplish the first condition, we request that the administrator working in Cornell
Theory Center install Java 1.5 in Linux cluster servers. And we set up a public key for our
account to allow SSH communication without any authentication. And
then we make
script program to copy all required programs to each node and generate the configuration
file for the job scheduler. Finally these script programs run the Job scheduler and our
page. The detail shell code for set up and the way to set up the p
ublic key are supplied in
the appendix.


5. PageRank Iteration in Cluster

Other than hardware setup for clustered machines, we need to modify the PageRank
process slightly. It is important to understand the relationship between matrix and
compressed matrix

representation, PageRank iteration and compressed matrix
representation, and the characteristics of input data as well as hardware environment and
some facts about “Parallel Java” package.

First, when we compress sparse matrix to compressed format, we in
tentionally eliminate
the cell with value 0 to get rid of unnecessary information at the cost of losing some
information. In our implementation, we kept the row information even with no entry in the
data structure as long as there was a trial to input any
data to the compressed format,
even if it was a 0. This is good enough for PageRank calculation because link matrix in
PageRank calculation is always square, so even if we don’t have full column information,
we still can get column dimension from row dimen
sion for PageRank calculation.
However

if the compressed representation
is

used in other calculation
s
, it should be
modified in order to keep column dimension information.

Second, in clustered PageRank calculation, each node needs to maintain just the por
tion
it should calculate from input file.
However

each node has to have full W vector
information to calculate iteration and also e, a’ and ea’ should be sliced so that their
appropriate portion is joined in the calculation. Also
the
resulting W vector in
each node is
a part of
the
full W vector, so they should be gathered properly by
the
master node and
re
-
distributed to each node again.

Third, there are
several
issues we don’t get correctly
in

the “Parallel Java” packag
e
. For
example, when we set up buff
ers for sending Boolean variable and receiving Boolean
array, we encountered a null pointer exception. The code was




10

BooleanItemBuf bif = new BooleanItemBuf();

BooleanItemBuf [] bifa = new BooleanItembuf[size];

bif.set(true);

world.gather(0, bif, bifa);


This problem was resolved after we re
-
ordered the lines as following,


BooleanItemBuf bif = new BooleanItemBuf();

bif.set(true);

BooleanItemBuf [] bifa = new BooleanItembuf[size];

world.gather(0, bif, bifa);


We still can’t find the justification for thi
s, but in practice, the order of statement was
important.


We also suffered from jobScheduler not running correctly for a long time. The reason was
that the jobScheduler takes some time to fully start

up to correctly work. So we modified
the shell script t
o wait some time after it issues java command for jobScheduler.


Moreover, there is a quite serious environmental problem with clustered machines in
the
Cornell Theory Center. First, it is quite hard to get
a
quota for
a
clustered machine,
especially
a
dev
elopment machine. We are supposed to use v2 linuxdev cluster for
development purpose, which has only 4 nodes, and it is hardly idle. Second, the worst
problem is that this cluster is quite unstable and the server goes down quite often. For
example, exactly

same set of codes work for one account but not for another account in
the same cluster
,

and

there were many times that clearly working codes
did not

work until
Theory Center agents reboot
ed

the system. We
spoke with

agents when we believed we
were right a
nd rebooting the system usually resolve
d

the problem. The cluster uses
“mpiexec” to distribute necessary files from login machine to the nodes and it failed often
with no evident reasons. Executing the same program does not guarantee the same
process under

clustered condition in v2 linux and v2 linuxdev clusters. The same code
work behaves different depending on the server. All of these were the major obstacles
that retarded development stage.



Because we spent most of
our

time on verifying PageRank itera
tion equation and
hardware environment setting, we had little chance to work on the PageRank calculation
program for clustered machines. Now we have got to the point where message passing

11

for data types except object data type is correctly working and most

parts of the
PageRank calculation program are working correctly.

One of t
he r
emaining parts that should be done with PageRank in
a
clustered
environment is, first, to figure out how to exchange object data type. After partial wCurrent,
which is the resul
t matrix of each iteration from a node, is calculated, every partial
matrices should be gathered and broadcasted to each node again for next iteration.
wCurrent is a compressed matrix and this data should be exchanged using Object (Array,
Item) Buffer. But

we did not have chance to work on this due to the time restriction.
However we decomposed the compressed matrix into three component arrays and
exchanged them instead of the real matrix. Although this is not the way we ultimately
want and this method brea
ks encapsulation of compressed matrix, now we can calculate
PageRank iteration by exchanging those arrays, but still we need to verify the program
with more data. Because we could not get enough of data in time, we only tested with
very small sized data sa
mple, some are real and some are randomly generated, ranging
from 5 by 5 to 1000 by 1000. Although we could get quite reasonable results from current
code work, it might not work for larger set of data or under specific conditions. Also as it is
always so,

there are several points that should be optimized. One good example is
transpose() in CCompressdMatrix for row
-
based and column
-
based matrix
representation transformation.

Another issue is to figure out the problem that was given in the box above. Probabl
y and
more than naturally, it is hard to expect that to happen based on the characteristics of
object oriented language. But it was quite true that at least under Cornell Theory Center
(CTC) environment according to our experiments. There is possibility th
at Parallel Java
(PJ) package is not well tuned for CTC environment because it is open to public for
academic purpose and not designed specifically for CTC environment. We may try
another package for the same code works.


Another issue is that we could not

implement the function to write the result in our server.
To implement this function, we can write the result file in each cluster node and collects all
of the results in the login machine.
A
nd then we can merge these results however this
function is not
developed right now because we spent time to test and do the experiment
our algorithms.

Lastly, the most important and probably the first thing that should be done is to establish
stable clustered computing environment somehow. Due to the reasons I describ
ed above,
it is really hard to work on program writing. Besides the pure programming difficulties in
clustered environment, when there was an error or bug, it was hard to tell if the cause was
bad implementation or instable cluster.





12

6. Experiments and R
esults

A. Experiment method

Our compressed matrix structure and parallel program were tested to measure the
performance. And the performance will be measured by comparing the time to finish each
process when we increase the data. To measure the performance

of our parallel program,
we will use 4 node and
5
0,
100
,
15
0

and

2
00 data will be used to test.
(50 data contains a
sparse matrix structure of 50 web links)


B. Result and conclusion

According to our experiment result,

the performance of our parallel prog
ram is decreased
rapidly when the data is increased
consistently
. According as the data is increased, the
data communication between the backend and the frontend will be increased rapidly.
Therefore

when we increase the data, the overall performance is hur
t a lot because of
excess of the communication.
A
s a result, the performance graph is not a linear function
but a
exponential

function
approximately
.
T
he fig
-
5 show the result of performance
measurement.

0
1000
2000
3000
4000
5000
6000
50
100
150
200
The number of Web Link
Time(Sec)
Time(Sec)


(Fig
-
5) the result of performance measurement


7. Acknowledgement


W
e would like to thank the following people list below :

William Y. Arms

as a our
Project Advisor


Daniel Ira Sverdlik

as a
Consultant in the Cornell Theory Center


Alan Kaminsky

as a
Parallel Java Package

writer


13


The Web Lab team wis
hes to thank the Internet Archive for their assistance and support.
This work is funded in part by National Science Foundation grants CNS
-
0403340, SES
-
0537606,
and IIS
-
0634677.


8. Reference

[1]

W. Arms, The Web Laboratory: A Joint Project of Cornell Univ
ersity and the Internet Archive ,
http://www.infosci.cornell.edu/SIN/WebLab/about.html

[2]

S. Brin, & L. Page, The Anatomy of a Large
-
Scale Hypertextual Web Search Engine, Stanford ,USA,
1998

[3]

J. Kleinberg, Authoritative sources in a hyperlinked environ
ment, Journal of the ACM 46, 1999, IBM
Research Report, 1997

[4]

N.Goharian, T.El
-
Ghazawi & D.Grossman
,
Enterprise Text Processing: A Sparse Matrix Approach
,
IEEE
,

2001

[5] A. Kaminsky, Parallel Java Library, http://www.cs.rit.edu/~ark/pj.shtml, 2007

[6]
Cornell Theory Center,Computing Resources for CTC Users,

http://www.tc.cornell.edu/Services/CTC+Resources.htm, 2007

[7]
J.Willcock & A. Lumsdaine, Accelerating Sparse Matrix Computations via Data

Compression, ACM
Press, 2006


9
. Appendix

A. Establishing SS
L Connection and Running MPI program

A.1 Creating ssh keys and a required MPI file

Run the following script before submitting a batch job. From a linux login node, type:

/ctc/tools/setup_ssh_mpd_linux.sh

This script creates ssh keys and a required MPI file
. MPICH2 is the supported MPI
implementation on the Linux clusters.

The

script /ctc/tools/setup_ssh_mpd_linux.sh performs the steps detailed below.

Note: you do not need to issue any of the commands illustrated here, they are done for
you automatically whe
n you run the script.


On linuxlogin1.tc.cornell.edu or linuxlogin2.tc.cornell.edu, creates the directory
~<your_userid>/.ssh


cd ~<your_userid>


mkdir .ssh


cd .ssh


Creates an SSH keypair to automate logons .


14


ssh
-
keygen
-
b 1024
-
t dsa
-
C <your_userid>


Ad
ds SSH public key to authorized_keys file (file is visible from all machines.)


cat id_dsa.pub >> authorized_keys


Note: Use append (>>) when adding keys to the authorized_keys file so
any existing keys are not overwritten.


Adds a required public key to yo
ur authorized_keys file. This is required to allow
the scheduler to launch jobs with your userid. In addition to adding the key, it is
also necessary to set the proper permissions on both the ssh folder and
authorized_keys file for ssh to function.


cat /c
tc/tools/velocity.pub >> ~<your_userid>/.ssh/authorized_keys


Changes permissions on authorized_keys to 600 and on the .ssh directory to
700.


Returns to your home directory:


chmod 600 authorized_keys


cd ..


chmod 700 .ssh


Creates the file .mpd.conf in yo
ur home folder. It will contain the parameter
MPD_SECRETWORD.


Sets permissions so only you can read it.


chmod 600 .mpd.conf

Running the script creates the following files:


$HOME
\
.ssh
\
authorized_keys

$HOME
\
.ssh
\
id_dsa

$HOME
\
.ssh
\
id_dsa.pub

$HOME
\
.mpd.conf


A.2 Testing MPI Interactively


Create a hosts file, mpd.hosts.


On compute nodes that have been assigned by vsched, this is very easy to do:


vsched
-
m


15


mv machines mpd.hosts


Alternatively, use a text editor like nano or vi or emacs to add any machine
n
ames you want to mpd.hosts, one name per line, and save it.


Start the mpd daemons.


Preferred method using mpdboot:


mpdboot
-
n <numberofhosts>


Alternate method if mpdboot doesn't work:


At the command prompt, enter: mpd &


To find the port, run

mpdtrac
e
-
l

It will return with the port number it's running on


To start mpd's on the other machines, run

ssh <nextmachinename> mpd
-
h <firstmachinename>
-
p <port>
-
d


Verify all the mpd daemons are running correctly.


Run mpdtrace to get a quick trace and see
all the machines


Run mpdringtest 3000 to run a ring around the mpd daemons


Verify you've got the right hosts with mpiexec <numberofhosts> hostname


A.3 Running an MPI Program Interactively


Make sure mpd.hosts is in the same directory as your executable.


Then in that directory,
issue:


mpiexec <numberofhosts> <mycodename>


When you are done, close all the daemons by running:


mpdallexit


B. Shell scripts


B.1 PageRank.xml


<?xml version="1.0" ?>


<!
--

Sample XML Job File
--
>


<job>


<nodes>4</nodes>


<minutes>2
0</minutes>


<type>interactive</type>


16


<affiliation>v2linuxdev</affiliation>


<run>/bin/sh $HOME/Lab/parallel.sh</run>


</job>


B.2 parallel.sh

#!/bin/sh

# parallel.sh

# @author Kwan Dong Kim


#Description of the script

#This script set up environment for Paral
lel Java in the V2 Linux Cluster


# Set up number of machine

NMACHINES=4

# Set up the number of processes

NPROCS=4

# Set ROOTDIR

ROOTDIR=$HOME/Lab

export ROOTDIR

# Set and create an output directory

tmphost=`hostname | cut
-
f1
-
d"."`

OUTDIR=$ROOTDIR/outpu
t

mkdir
-
v $OUTDIR

export OUTDIR

#Change directory from "linuxlogin" to "vii000XX" node

cd /tmp

# Set up the SSH public key authentication

vsched
-
m

mpdboot
-
n $NMACHINES
-
f /tmp/machines

# Create a local directory on /tmp

# Copy files to local disk(vii00
0XX)

mpiexec
-
n $NPROCS $ROOTDIR/setup.sh


17

TMPDIR=/tmp/$USER

export TMPDIR

mv /tmp/machines $TMPDIR/machines

# Run the executable from local disk(vii000XX)

cd $TMPDIR

# Copy "config generator.sh" file

echo "copying $ROOTDIR/config_generator.sh
$TMPDIR/con
fig_generator.sh "

cp $ROOTDIR/config_generator.sh $TMPDIR/config_generator.sh

cp $ROOTDIR/colex $TMPDIR/colex

cp $ROOTDIR/linex $TMPDIR/linex

#Change directory from "linuxlogin" to "vii000XX" node

cd /tmp/$USER

echo "Host name"

echo `hostname`

# Gener
ate config file

./config_generator.sh $NMACHINES

# EXport PJ package class path

export CLASSPATH=.:/tmp/kk386/pj.jar
#:/home/nfs/ctcfsrv11/m/$USER/Lab/pj.jar

#Run the JobScheduler

echo "/usr/java/jre1.5.0_11/bin/java edu/rit/pj/cluster/JobScheduler
schedu
ler.conf"

/usr/java/jre1.5.0_11/bin/java edu/rit/pj/cluster/JobScheduler
scheduler.conf &

#Wait 10 sec to run the edu/rit/pj/cluster/JobScheduler correctly

echo "Start 10
-
second sleep"

sleep 10

# Export PJ package class path

export CLASSPATH=.:/tmp/kk386/p
j.jar

#Run the CMainClusterProcess

echo "/usr/java/jre1.5.0_11/bin/java
-
Dpj.np=3 CMainClusterProcess
pidList >& result.$tmphost.out"


18

/usr/java/jre1.5.0_11/bin/java
-
Dpj.np=3 CMainClusterProcess pidList >&
result.$tmphost.out

# Copy output files to your ou
tput directory on the fileserver

# Delete all remaining files on /tmp/$USER

mpiexec
-
n $NPROCS $ROOTDIR/cleanup.sh

# Cancel the all process and node when all of process is finished.

vsched

c

B.3 setup.sh

#!/bin/sh

# setup.sh

# @author Kwan Dong Kim


#Des
cription of the script

#This script copy the required files to each nodes(both frontend and
backend)


#Remove all data which is created previously

rm
-
f
-
r /tmp/$USER

#Set up TMP directory

TMPDIR=/tmp/$USER

#Make TMP directory

mkdir $TMPDIR

echo $TMPDIR

ex
port TMPDIR

#Make Root Directory

mkdir $ROOTDIR $TMPDIR

echo "cp
-
r $ROOTDIR/ $TMPDIR/"

#Copy all needed data and programs to each cluster nodes(vii000XX)

cp $ROOTDIR/* $TMPDIR/

cp
-
r $ROOTDIR/edu $TMPDIR/

cp
-
r $ROOTDIR/pr $TMPDIR/


19

B.4 config_ genera
tor.sh

#!/bin/sh

# config_ generator.sh

# @author Kwan Dong Kim


#Description of the script

#This script generate the configuration file for job scheduler


# save variable java_path

JAVA_PATH="/usr/java/jre1.5.0_11/bin/java"

# save variable PJ_package_p
ath

PJ_PATH="/tmp/$USER/pj.jar"

# save variable log file

LOG_PATH="/tmp/$USER/scheduler.log"

# save variable web host

WEB_HOST_PATH=".tc.cornell.edu"

# save number of cluster to use

Num_Cluster=$1

#Set up the variable index i

i=1

#Get the Frontend node by
using vsched, colex and linex command

Frontend=`vsched
-
u v2Linuxdev | grep $USER | ./colex 1 | ./linex 1`

#Remove empty space in the variable Frontend

parsedFrontend=`echo "$Frontend" | tr
-
c '
\
012[a
-
zA
-
Z][0
-
9].
\
-
_' '
\
n' | uniq`

echo
-
e "$parsedFrontend"

#Increase index i

i=`expr $i + 1`

# Save variable web host

WEB_HOST_PATH="$parsedFrontend.tc.cornell.edu"

#start write the config file

echo "#Parallel Java Job Scheduler Configuration file" > scheduler.conf

echo "#Frontend processor : $Frontend" >>schedule
r.conf

echo "cluster v2linuxdev" >>scheduler.conf

echo "logfile $LOG_PATH" >>scheduler.conf

echo "webhost $WEB_HOST_PATH" >>scheduler.conf

echo "webport 8080" >>scheduler.conf

echo "schedulerhost localhost" >>scheduler.conf


20

echo "schedulerport 20617" >>sch
eduler.conf

echo "frontendhost $WEB_HOST_PATH">>scheduler.conf

#Start process Backend node

while [ $i
-
le $Num_Cluster ]

do


#Get the Balcked node by using vsched, colex and linex command


Backend=`vsched
-
u v2Linuxdev | grep $USER | ./colex 1 | ./linex

$i`


#Remove empty space in the variable Frontend


parsedBackend=`echo "$Backend" | tr
-
c '
\
012[a
-
zA
-
Z][0
-
9].
\
-
_' '
\
n' | uniq`


#Write the Balcked node in the config file


echo "backend $parsedBackend $parsedBackend $JAVA_PATH
$PJ_PATH">>scheduler.c
onf




#Increase index i


i=`expr $i + 1`

done

# Copy scheduler.conf file to Linux login machine

cp $TMPDIR/scheduler.conf

$ROOTDIR/output/scheduler.conf

B.5 cleanup.sh

#!/bin/sh

# cleanup.sh

# @author Kwan Dong Kim


#Description of the script

#This scr
ipt copy the final result and clean up the nodes and finish the all
process


# Copy all log and result file towad linux login machine

cd $TMPDIR/

cp $TMPDIR/result.* $OUTDIR

cp $TMPDIR/*.log $OUTDIR


#Remove all data and program used in each node(Vii000XX)

rm
-
f
-
r /tmp/$USER

C. Java Codes


21

A. CCompressedMatrix.java

/**


* CCompressedMatrix.java


* Created in 2007. 02. 04


* @author Chang Min Kim


* netID : ck273

*/


/**


* Description of the Class


*

This class is representation of compressed sparse matrix


* CSM(compressed sparse matrix) consists of three linked lists.


* One is for row representation


* The sencond is for column representation


* The last is for value for the element of a matrix


* For more information, refer to the documentation.


*
/


package pr;


import java.util.*;

import Jama.Matrix;


public class CCompressedMatrix {


// LinkedList containing beginning cell number of colList for the row


private LinkedList<Integer> rowList;


// LinkedList containing column number of the correspond
ing row


private LinkedList<Integer> colList;


// LinkedList containing value of corresponding cell


private LinkedList<Double> valueList;


// These four variables are only to keep the information of the original matrix even if there is
none


// last numbe
r of column that actually contains value other than 0


private int lastCol;


// last number of row that actually contains value other than 0


private int lastRow;


// actual row dimension of this matrix including the rows with only 0's


private int rowDim;


// actual column dimension of this matrix including the columns with only 0's


private int colDim;



public CCompressedMatrix() {



// initialize object



rowList = new LinkedList<Integer>();



colList = new LinkedList<Integer>();



valueList = new Linke
dList<Double>();



lastCol =
-
1;


22



lastRow =
-
1;



rowDim = 0;



colDim = 0;


}



public boolean compareWithMatrix(Matrix mt) {



// compare this object with matrix object



// If dimensions don't match, they are different



if ( this.rowDim != mt.getRowDi
mension() || this.colDim !=
mt.getColumnDimension() ) {




return false;



}



// cell by cell comparison



for ( int i=0; i< mt.getRowDimension() ; i++ ) {




for ( int j=0; j< mt.getColumnDimension() ; j++ ) {





if ( mt.get(i, j)!=getValueAt(i,j) ) {






System.out.println(i + " " + j);






return false;





}




}



}



return true;


}



public int getRowDimension() {



return rowDim;


}



public void setRowDimension(int i) {



rowDim = i;


}



public int getColDimension() {



return colDim;


}



pub
lic void setColDimension(int i) {



colDim = i;


}



public int getLastCol() {



return lastCol;


}



public void setLastCol(int i) {



lastCol = i;


}


23



public int getLastRow() {



return lastRow;


}



public void setLastRow(int i) {



lastRow = i;


}



p
ublic void setDims(int i, int j) {



this.rowDim = i;



this.colDim = j;


}



public void clear() {



// initialize this object



rowDim = 0;



colDim = 0;



lastRow =
-
1;



lastCol =
-
1;



rowList.clear();



colList.clear();



valueList.clear();


}



publ
ic boolean empty() {



// tells if compressedMatrix object has any element



if ( rowDim == 0 || lastRow ==
-
1 || colDim == 0 || lastCol ==
-
1 )




return true;



else




return false;


}



public boolean isEqualTo(CCompressedMatrix cm) {



// compares two

compressedMatrix objects



// If dimensions don't mathc, they are different



if ( this.rowDim != cm.getRowDimension() || this.colDim !=
cm.getColDimension() ) {




return false;



}



// cell by cell comparison



for ( int i=0 ; i<rowDim ; i++ ) {




for

( int j=0; j<colDim; j++ ) {





if ( getValueAt(i,j) != cm.getValueAt(i, j) ) {






return false;





}




}


24



}



return true;


}



public LinkedList<Integer> getRowList() {



// returns rowList



return rowList;


}



public LinkedList<Integer> getColL
ist() {



// returns colList



return colList;


}



public LinkedList<Double> getValueList() {



// returns valueList



return valueList;


}



// add cell elements to compressed matrix


// Becasue CCompressedMatrix is row based compression,


// every colum
n for the row should be processed before row increases


// For the data ordered by column first, simply add elements first and transpose this.


// For more information about representation, refer to the documentation.


public void addElement(int i, int j,
double k) {



// checks if there is already existing value in that cell



if ( getValueAt(i,j) == 0 ) {




// increase dimension




// This is necessary to keep the information of original matrix or
number of pages as much as possible




// in case last ro
ws don't have any link to it.




if ( rowDim <= i ) {





rowDim = i+1;




}




if ( colDim <= j ) {





colDim = j+1;




}




// if the value is not 0, compressed matrix will contain the value
information for the cell




if ( k != 0.0 ) {





// beginn
ing of new row





if ( lastRow < i ) {






// If there were rows with no values before this row
i,






// the rows from lastRow+1 to i
-
1 should be filled
with
-
1






// to indicate there were empty rows


25






for ( int rowLoc = lastRow+1; rowLoc<i;
row
Loc++ ) {







rowList.add(new Integer(
-
1));






}






// now lastRow with any values is i






lastRow = i;






// column list size is the index of beginning index
number for this row with non zero value






rowList.add(new Integer(colList.size())
);






// column list contains the column index of input
cell






colList.add(new Integer(j));






// value list contains the value for given cell






valueList.add(new Double(k));






// lastCol indicates the last column index whose
corresponding cel
l has non zero value






if ( lastCol < j ) {







lastCol = j;






}





}





// in case a cell with the same row number as given cell was
already inserted to the matrix already





else if ( lastRow == i ) {






// We don't need to modify rowList,
but simply add
column and value of the cell to the lists.






colList.add(new Integer(j));






valueList.add(new Double(k));






if ( lastCol < j ) {







lastCol = j;






}





} else {






System.out.println("You are trying to add an
element whi
ch should be inserted earlier");






System.out.println("Check if you are trying to
insert elements in sorted order as specified");





}




}



} else {




System.out.println("There is already existing value in that cell, try
'setValuAt(i,j)'");



}


}



public CCompressedMatrix multiply(CCompressedMatrix cm) {



// inner dimension should match to perform matrix multiplication



if ( this.colDim != cm.getRowDimension() ) {




System.out.println("Dimension mis
-
match for multiplication");


26




return null;



} else {




CCompressedMatrix resCM = new CCompressedMatrix();


for ( int i =0; i <= lastRow; i++ ) {





// proceed only to the last index of column of this matrix and
last row index of cm for efficiency





for ( int j = 0; j <= cm.getLastCol(
); j++ ) {






double accumulator = 0.0;






for ( int k = 0; k <= lastCol; k++ ) {







accumulator += getValueAt(i,k)*
cm.getValueAt(k,j);






}






resCM.addElement(i, j, accumulator);





}




}




resCM.setColDimension(cm.getColDimension());




r
esCM.setRowDimension(this.rowDim);




return resCM;



}


}



// tranpose this matrix


// after(i,j) = before(j,i)


public CCompressedMatrix transpose() {



CCompressedMatrix resCM = new CCompressedMatrix();



for ( int i=0; i <= lastCol; i++ ) {




for
( int j=0; j<= lastRow; j++ ) {





resCM.addElement(i, j, getValueAt(j,i));




}



}



resCM.setLastCol(this.lastRow);



resCM.setLastRow(this.lastCol);



resCM.setColDimension(this.rowDim);



resCM.setRowDimension(this.colDim);



return resCM;


}



// multiplies the values by scalar


public CCompressedMatrix ScalarMultiply(double x) {



CCompressedMatrix resCM = new CCompressedMatrix();



// multiply values by the scalar



for ( int i = 0 ; i < rowList.size(); i++ ) {




resCM.getRowList().add(new

Integer(rowList.get(i).intValue()));



}




for ( int i = 0 ; i < valueList.size(); i++ ) {




resCM.getColList().add(new Integer(colList.get(i).intValue()));


27




resCM.getValueList().add(new Double(valueList.get(i).doubleValue()
* x));



}



resCM.setLast
Col(this.lastCol);



resCM.setLastRow(this.lastRow);



resCM.setColDimension(this.colDim);



resCM.setRowDimension(this.rowDim);



return resCM;


}



public CCompressedMatrix plus(CCompressedMatrix cm) {



// dimension check



if ( this.colDim == cm.getCol
Dimension() && this.rowDim ==
cm.getRowDimension() ) {




// proceed to the index of the matrix which has larger last index for
efficiency




int lc, lr;




if ( lastCol < cm.getLastCol() ) {





lc = cm.getLastCol();




} else {





lc = lastCol;




}




if ( lastRow < cm.getLastRow() ) {





lr = cm.getLastRow();




} else {





lr = lastRow;




}




CCompressedMatrix resCM = new CCompressedMatrix();




for ( int i =0; i <= lr; i++ ) {





for ( int j=0; j<= lc; j++ ) {






resCM.addElement(i, j, getValu
eAt(i,j) +
cm.getValueAt(i, j));





}




}




resCM.setLastCol(lc);




resCM.setLastRow(lr);




resCM.setColDimension(this.colDim);




resCM.setRowDimension(this.rowDim);




return resCM;



} else {




System.out.println("Dimension mis
-
match for additi
on");




return null;



}


}



public CCompressedMatrix minus(CCompressedMatrix cm) {



// minus is the same as plus with the matrix multiplied by
-
1


28



return plus(cm.ScalarMultiply(
-
1));


}



// converts matrix to compressed matrix data structure


// This

is not necessary for this project.


// Written only for convenience for experiments


public void compressFrom(Matrix mt) {



rowList.clear();



colList.clear();



valueList.clear();



for ( int i=0; i<mt.getRowDimension(); i++ ) {




for ( int j=0; j<mt.g
etColumnDimension(); j++ ) {





addElement(i,j,mt.get(i,j));




}



}



this.rowDim = mt.getRowDimension();



this.colDim = mt.getColumnDimension();


}



// Some vectors such as page vector will be represented as compressed matrix for
convenient calculati
on


// Sometimes it is necessary to set the row dimension of the matrix not to lose dimension
information


// in case last rows have values of 0.


// This is due to the characteristics of compressed representation of sparse matrix


// This is not necessary

for our given data set.


public void toVerticalVectorInCompressedMatrixFormat(int i) {



rowDim = i;



colDim = 1;



lastCol = 0;


}



// sets value of the cell indexed by (i,j) in regular matrix with value k


public void setValueAt(int i, int j, double k
) {



int begin;



int end;



boolean valFound = false;



// first, check if cell (i,j) has value which is not 0, if so modify it, otherwise do
nothing



// This can be done more siply as following



// if (getValueAt(i,j) != 0)



//


modify correspondin
g valueList



// This was written for more clarity



if ( i < rowList.size()
-

1 ) {




begin = rowList.get(i).intValue();




end = rowList.get(i+1).intValue();




if ( begin !=
-
1 && end !=
-
1 ) {


29





for ( int l = begin; l < end && !valFound; l++ ) {






if ( colList.get(l).intValue() == j ) {







valueList.set(l, new Double(k));







valFound = true;






}





}





if ( !valFound ) {






System.out.println("There is no value for the cell,
try addElementAt");





}




} else if ( begin ==
-
1 ) {





System.out.println("There is no value for the cell, try
addElementAt");




} else if ( end ==
-
1 ) {





int idx = i+2;





while ( end ==
-
1 && idx < rowList.size() ) {






end = rowList.get(idx).intValue();






idx++;





}





if ( end ==
-
1 ) {






for ( int l = begin; l < colList.size() && !valFound;
l++ ) {







if ( colList.get(l).intValue() == j ) {








valueList.set(l, new
Double(k));








valFound = true;







}






}






if ( !valFound ) {







System.out.println("There is no value

for the cell, try addElementAt");






}





} else {






for ( int l = begin; l < end && !valFound; l++ ) {







if ( colList.get(l).intValue() == j ) {








valueList.set(l, new
Double(k));








valFound = true;







}






}






if ( !valFound
) {







System.out.println("There is no value
for the cell, try addElementAt");






}





}




}



} else if ( i == rowList.size()
-

1 ) {


30




begin = rowList.get(i).intValue();




end = colList.size();




if ( begin ==
-
1 ) {





System.out.println("The
re is no value for the cell, try
addElementAt");




} else {





for ( int l = begin; l < end && !valFound; l++ ) {






if ( colList.get(l).intValue() == j ) {







valueList.set(l, new Double(k));







valFound = true;






}





}





if ( !valFound )

{






System.out.println("There is no value for the cell,
try addElementAt");





}




}



} else {




System.out.println("There is no value for the cell, try addElementAt");



}


}



// get value of the cell indexed by (i,j) in regular matrix


public do
uble getValueAt(int i, int j) {



int begin;



int end;



// find the colList range for the given row i first, and find j in colList within the
range



// if i and j can be located in data structure, there is non 0 value for the cell (i,j)



// otherwise r
eturn 0



if ( i < rowList.size()
-

1 ) {




begin = rowList.get(i).intValue();




end = rowList.get(i+1).intValue();




if ( begin !=
-
1 && end !=
-
1 ) {





for ( int k = begin; k < end; k++ ) {






if ( colList.get(k).intValue() == j ) {







return v
alueList.get(k).doubleValue();






}





}





return 0;




} else if ( begin ==
-
1 ) {





return 0;




} else if ( end ==
-
1 ) {





int idx = i+2;





while ( end ==
-
1 && idx < rowList.size() ) {






end = rowList.get(idx).intValue();






idx++;


31





}





if ( end ==
-
1 ) {






for ( int k = begin; k < colList.size(); k++ ) {







if ( colList.get(k).intValue() == j ) {








return
valueList.get(k).doubleValue();







}






}






return 0;





} else {






for ( int k = begin; k < end; k++ )
{







if ( colList.get(k).intValue() == j ) {








return
valueList.get(k).doubleValue();







}






}






return 0;





}




}



} else if ( i == rowList.size()
-

1 ) {




begin = rowList.get(i).intValue();




end = colList.size();




if ( begin ==

-
1 ) {





return 0;




} else {





for ( int k = begin; k < end; k++ ) {






if ( colList.get(k).intValue() == j ) {







return valueList.get(k).doubleValue();






}





}





return 0;




}



} else {




return 0;



}



return 0;


}



// for conven
ience, get value from given compressed matrix other than 'this' matrix


public double getValueAt(int i, int j, CCompressedMatrix cm) {



int begin;



int end;



if ( i < cm.getRowList().size()
-

1 ) {




begin = cm.getRowList().get(i).intValue();




end =
cm.getRowList().get(i+1).intValue();




if ( begin !=
-
1 && end !=
-
1 ) {





for ( int k = begin; k < end; k++ ) {


32






if ( cm.getColList().get(k).intValue() == j ) {







return
cm.getValueList().get(k).doubleValue();






}





}





return 0;




} el
se if ( begin ==
-
1 ) {





return 0;




} else if ( end ==
-
1 ) {





int idx = i+2;





while ( end ==
-
1 && idx < cm.getRowList().size() ) {






end = cm.getRowList().get(idx).intValue();






idx++;





}





if ( end ==
-
1 ) {






for ( int k = begi
n; k < cm.getColList().size(); k++ )
{







if ( cm.getColList().get(k).intValue() ==
j ) {








return
cm.getValueList().get(k).doubleValue();







}






}






return 0;





} else {






for ( int k = begin; k < end; k++ ) {







if ( cm.getColLis
t().get(k).intValue() ==
j ) {








return
cm.getValueList().get(k).doubleValue();







}






}






return 0;





}




}



} else if ( i == cm.getRowList().size()
-

1 ) {




begin = cm.getRowList().get(i).intValue();




end = cm.getColList().size();




if ( begin ==
-
1 ) {





return 0;




} else {





for ( int k = begin; k < end; k++ ) {






if ( cm.getColList().get(k).intValue() == j ) {







return
cm.getValueList().get(k).doubleValue();






}





}


33





return 0;




}



} else {




return 0;



}



return 0;


}



public CCompressedMatrix cutRow(int i, int j){



CCompressedMatrix resCM = new CCompressedMatrix();



for ( int k = i ; k < j ; k++ ){




for ( int l = 0 ; l < colDim; l++){





resCM.addElement(k
-
i, l, getValueAt(k, l));




}



}



re
sCM.setRowDimension(j
-
i);



resCM.setColDimension(colDim);



return resCM;


}

}

B. CMainProcess.java


/**


* CMainProcess.java


* Created in 2007. 01. 30


* @author Chang Min Kim


* netID : ck273


* email : ck273@cs.cornell.edu


* Master of Engineering


*
Computer Science at Cornell University


*/


/**


* Description of the Class


*


*/


/**


* Description of the Variables


*


*/


package pr;


import java.io.BufferedReader;

import java.io.FileReader;

import java.io.IOException;


34

import java.io.InputStreamR
eader;

import java.io.StringReader;

import java.net.MalformedURLException;

import java.net.URL;

import java.util.LinkedList;

import java.util.StringTokenizer;

import java.io.BufferedWriter;

import java.io.FileWriter;

import Jama.Matrix;


public class CMain
Process implements HTMLHandler {


final double DAMP = 0.80;


LinkedList<String> urlList;


Matrix linkMatrix;


Matrix rank;


String perDoc;


boolean title = false;


boolean link = false;


boolean mail = false;


boolean img = false;


String linkRec = "";


St
ring fileName;


CCompressedMatrix compMatrix;


int siteSize;



public CMainProcess(String fn) {



siteSize = 130;



compMatrix = new CCompressedMatrix();



//fileName = fn;



fileName = "test4.txt";



urlList = new LinkedList<String>();



perDoc = "";



li
nkRec = "";



importURL();



System.out.println("Canonicalizing Sites");



System.out.println("This may take a while depending on network connection
\
n");



canonicalize();






//load test data



//importTest();






CCompressedMatrix cmTest = new CCompres
sedMatrix();



cmTest.compressFrom(linkMatrix);





if ( compMatrix.isEqualTo(cmTest)){




System.out.println("test matrix and compressed matrix are
identical
\
n");


35



}



else{




System.out.println("test matrix and compressed matrix are *** not ***
identic
al
\
n");



}






if ( compMatrix.compareWithMatrix(linkMatrix)){




System.out.println("Link matrix and compressed matrix are
identical
\
n");



}



else{




System.out.println("Link matrix and compressed Matrix are *** not ***
identical
\
n");



}






//outp
utLinkMatrix();



//outputPIDList();



calc_pageRank();



calc_pageRank_compMatrix();


}



public void importTest() {



linkMatrix = new Matrix(5,5);



rank = new Matrix(5,1);



try {




BufferedReader in = new BufferedReader(new FileReader("list.txt"));




String inStr;




while( (inStr = in.readLine()) != null) {





StringTokenizer st = new StringTokenizer(inStr);





int from = Integer.parseInt(st.nextToken());





int to = Integer.parseInt(st.nextToken());





linkMatrix.set(from, to, 1.0d);





compMa
trix.addElement(from, to, 1.0d);




}




compMatrix.setDims(5, 5);



}



catch(IOException e){




}


}






public void importURL() {



// import test4.txt



try {




BufferedReader in = new BufferedReader(new
FileReader(fileName));


36




String inStr;




whi
le( (inStr = in.readLine()) != null) {





if(inStr.endsWith("/")) {






inStr += "index.html";





}





if(!urlList.contains(inStr)) {






urlList.add(new String(inStr.trim()));





}




}



}



catch(IOException e){




}



linkMatrix = new Matrix(urlL
ist.size(),urlList.size());



rank = new Matrix(urlList.size(),1);


}





public void canonicalize() {



for(int i =0; i<urlList.size(); i++) {




perDoc = "";




title = false;




link = false;




img = false;




String content = "";




URL url = null;




String tmp = urlList.get(i);




tmp = tmp.trim();




if (tmp.endsWith("/")) {





System.out.println("bad form ou url");




}




try {





url = new URL(tmp);




}




catch ( MalformedURLException m) {





System.out.println("Illegal Format of URL");




}





perDoc += url.toString() + "
\
n";






try {





// Read all the text returned by the server





InputStreamReader sr = new InputStreamReader(url.openStream());





BufferedReader in = new BufferedReader(sr);






String str;





String contentToBeRevi
sed = "";


37





while ((str = in.readLine()) != null) {






contentToBeRevised += str.trim() + " ";





}






int count = 0;





while( count < contentToBeRevised.length()
-
5) {






if ( contentToBeRevised.charAt(count) == '=' ) {







if(contentToBeRevis
ed.charAt(count
-
1) == ' ') {








contentToBeRevised =
contentToBeRevised.substring(0, count
-
1) +








contentToBeRevised.substring(count);








count
--
;







}







if(contentToBeRevised.charAt(count+1) == ' ') {








contentToBeRevised =
conten
tToBeRevised.substring(0, count+1) +








contentToBeRevised.substring(count+2);







}






}






count++;





}





content = contentToBeRevised;






sr.close();





in.close();





} catch (MalformedURLException e) {





} catch (IOException e) {





}



//



String cacheSite = "cache_" + i + ".txt";

//



try {

//




BufferedWriter caWriter = new BufferedWriter(new
FileWriter(cacheSite));

//




caWriter.write(content);

//




caWriter.close();

//



} catch (IOException e1) {

//




e1.printStackTrace(
);

//



}






try {





//System.out.println(i+"th site");





if ( (i % 25 == 0) && ( i !=0) ) {


38






System.out.println(( (int)(((float)i/urlList.size())*100)) + "% done");





}





HTMLParserFactory parserFactory = HTMLParserFactory.getInstanc
e();




HTMLParser saxParser = parserFactory.getNewSAXHtmlParser();




saxParser.parse(content, this);




}





catch (Exception e) {





e.printStackTrace();




}





// analyze perDoc string to get info




BufferedReader br = new Buffere
dReader(new
StringReader(perDoc));




String lineString = "";





try {





while((lineString = br.readLine())!= null) {






if( lineString.trim().equalsIgnoreCase("<title>")) {







title = true;







link = false;







img = false;






}






else
i
f( lineString.trim().equalsIgnoreCase("</title>")) {







title = false;







link = false;







img = false;






}






else
if( lineString.trim().equalsIgnoreCase("<a>")) {







link = true;







img = false;







title = false;






}






else
i
f( lineString.trim().equalsIgnoreCase("</a>")) {







link = false;







title = false;