Using Clustering to Develop a College Football Ranking System Fall 2005 ECE 539

plantationscarfΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

71 εμφανίσεις

Joseph Detmer

Page
1

ECE 539






Using Clustering to
Develop a College
Football Ranking System




Fall 2005

ECE 539

Final Project




Joseph Detmer



2

Joseph Detmer


ECE 539


A
bstract


This project went through the task of creating a college football ranking system
based purely on statistics. This data was o
btained from two impartial websites.
The algorithm (explained in greater detail below) first clusters the data and then
places this clustering into an equation where the final rank is resultant from. It
was found that a reasonable system was created. Da
ta from our test data was
very close to that of polls taken.


Introduction


Being a former soccer player, I have long since been wary to embrace the game
of football.

The game was back and forth, long, and there was a lot of time
between the actual “actio
n.” High school football is not all that entertaining, and I
never really was able to embrace a professional or collegiate team, feeling too
distant from any particular team. Then I came to college. Games were much
more exciting


the amount and enthusi
asm of fans, the athletes were bigger,
and the skill of the players very much improved. Not many other things can
make thousands of people wake up early on a cool Saturday morning in the fall
to go sit in a parking lot grilling brats several hours. While

professional football is
extremely exciting, college football is not far off, and is held higher by some
people. Whether it be an unranked team upsetting a highly ranked opponent, a
national title game, or an alma mater is playing the always hated arch
-
r
ival,
college

football entertains millions.


At the end of each season, any college football team that has at least 6 wins is
eligible for a bowl.

Since 1998, a computer system has been in place to
determine the top college football teams in the country.

The initial purpose o
f the
system was to match “equal”

teams up to play in exciting contests for more than
just the national championship game. This system brought the top four bowl
games together, letting each have a turn for the national title game. T
hese bowls

3

Joseph Detmer


ECE 539


initially had been reserved for the champions of the best conferences. This new
system allowed a good team from a somewhat weaker conference to play in a
big game at the end of the season.



Any system created to predict how good a team is or
who wins a game will
eventually fail. There are too many factors that cannot be taken into account,
such as star players being injured, an amazing strategy developed by a coaching
staff, or just a team having an off day. This process of determining who t
he best
is will be done by determining on the average, which team is the best.


Motivation


The most difficult question to determine is,
how do you qualify how

good


a team
is? Is it their record? Is it how many points they score, total yards of offense

they have, or turnovers they create? If one was to look up statistics for college
football teams, one could find any kind of stat they would ever want. For
example, one could find the rankings of all teams average yardage on 1
st

down in
the 3
rd

quarter,

if you wanted it. However, such information seems slightly too
specific for the que
stion of how

good


a team is. Most experts would agree that
a combination of many factors can be combined into a quantitative
representation of how good a team is. The
BCS system uses several computer
models combined with polls to determine its rankings. I will create a system
using only a computer model.


In this project, it was determined that the data necessary for determining how
good a team is would be done by a sm
all subset of general data. These data
points will not directly corresponding to inputs to the system.
Each data set will
first be clustered into several clusters. Data from one year will then create a
function from the clustered data. This function wi
ll be applied to a second years

(2004)

data, which will result in a test set.




4

Joseph Detmer


ECE 539



Data Collection


The first and most difficult part of the project was getting the necessary data.
The first thing that was necessary was to decide what statistics are most
in
fluential in how good a team was. As above, there are a huge amount of
statistics that could be retrieved. Which ones should be chosen? We want a
large enough set of data to have good results, but not so large that the statistics
become redundant and th
e data difficult to obtain. The problem was initially
broken up into pieces:



Offense



Defense



Special Teams

& Turnovers



Record
&
Strength of Schedule


These four pieces can more easily be handled than the whole together.


Offense:

A good offense is integra
l to how good a team is. If a team is never able to score
points, it will never win a game. It is not unusual for defense to score points, but
to count on them for the entire point production of a football team would be
suicide. So how do we decide how
good a team is?
Rushing yards have proven
to be a very important part of a football team. If a team cannot run the ball
consistently, it is difficult to tire a defense. Passing yardage is also extremely
important. Passing can produce quick points, or k
eep the ball in your possession
late in the game. However, a balanced attack is truly the key to a good offense.
While teams that have either a solid running game or solid passing game can
sometimes be effective, a team with both is much better. Having
a good rushing
game and a good passing game keeps the defense guessing what will come
next. Finally, we must remember that yards mean nothing if a team does not
score. Offensive scoring is therefore an integral part. This leaves us with 4 data

5

Joseph Detmer


ECE 539


sets for

offense:



Rushing yardage



Passing yardage



Total Offensive yardage



Offensive Scoring


Defense

The saying is “Offense wins games, Defense wins championships”. But how do
we determine how good a defense is. Most would say the exact opposite of
determining h
ow good an offense is. Therefore for likewise reasons as in
the
offense

section
, our data sets for defense are:



Rushing yardage allowed



Passing yardage allowed



Total yardage allowed



Total Defensive points allowed


Special Teams & Turnovers

While offense a
nd defense are a huge part of football, a lot of the time, games
come down to special teams. If neither team can produce any yards on offense,
the team with better field position will more likely win. I was hoping to find a
statistic for average starting

field position. However after a tedious search
provided nothing of the sort, I opted for statistics that could form something like it.

Finally, we can’t forget turnovers. Often times, a single turnover changes the
course of a game. There are many dif
ferent types (fumbles, interceptions


offensive, defensive), but did we really need to take all these different ones into
account. If a team creates many turnovers, but gives up just as many, the team
does not really gain anything. Therefore turnover ma
rgin is included in our data
set. This brings data collected in this section to the following sets:



Turnover margin



Net punting yardage



Punt return yardage


6

Joseph Detmer


ECE 539




Kick return yardage



Kick return defensive yardage allowed


Record & Strength of Schedule

The single

most important thing taken into account when determining how good a
football team is, is their record. The whole purpose of a football game is to see
who will win the game. If a team loses, it is very difficult to argue that that team
is better. Statis
tics for how many yards a team has, or how many turnovers they
create are an important part of determining how good a team is. However, the
most important part should be the teams record. A good team will rarely get
beat. A poor team will rarely win. T
here is an important statistic that should be
taken into account though. If a team has a small amount of losses, but has
played no “good” teams, while a team with a couple losses has only played
“good” teams, who is better if the teams have not played? S
trength of schedule
is calculated as 2 parts record of your opponents, 1 part record of your
opponents’ opponents. This develops a decimal between 0 and 1. Our last two
data sets that we will use are:



Record



Strength of schedule




As stated above, ther
e are many websites that have statistics on college football.
Some have very little data, some have quite a few statistics, but few useful ones.
In the end I
chose two websites that contained all the data I was looking for.
Since they are different page
s, the two sites each required a unique way of
parsing the .htm files for the necessary data. One reason I chose these two
sites was that they kept archived data. Therefore I was able to obtain data from
last year as well as this year. All data except
strength of schedule was gathered
from the first site. The data was

be gathered from the following two sites:

http://web1.ncaa.org/d1mfb/natlRank.jsp?div=4&site=org


7

Joseph Detmer


ECE 539


http://www.warrennolan.com/football/2005/sos


One final site used was used to classify the

training data. I was only able to find
one source that ranked all 119 teams. Most polls and rankings only rank the top
25 teams. This site did not archive data, so I was only able to obtain this ranking
for the current year.

http://cbs.sportsline.com
/collegefootball/polls/119/index1


I developed two perl scripts to do the parsing of the data. They are essentially
the same, but slightly modified for different instances.
Data for 2005 was used
as the training set, and data for 2004 was used as the tes
ting set.
It should be
noted that these scripts were developed after downloading the files on Sun
Solaris machines. Problems occurred when downloading with Internet Explorer
on Windows. It seems that the webpage is saved in different formats. Therefore

if you wish to recreate this project, proceed with caution. It also should be noted
that all directories should be created before any commands are executed. The
commands do not check if directories or files exist before using them. The user
only needs
to worry all directories are created, and initial htm files exist and are
named correctly. All Matlab execution was done on a Windows platform.

The
two perl files are named
extract_stuff.pl
and

2004extract_stuff.pl
.

These

two files
output to files that

contain the team name a
s well as the specific data
set.


Data Manipulation


The data sets created by the two initial perl scripts are the inputs to our clustering
method. This clustering method takes each individual data set and clusters it,
using ten cl
usters. The number of clusters was chosen to be large enough so
that separation between large statistics and small statistics was noticeable. It
was chosen to be
still small enough so that clusters were of reasonable size.
Each data set was clustered, a
nd each data point in that set was assigned to a

8

Joseph Detmer


ECE 539


particular cluster. Each cluster was then given a weight (integers from 1 to 10),
according to the rank of that cluster with 1 being the lowest and 10 being the
highest. Once this occurred, the new weights

were output to a file, where they
would be rearranged.

The two files that took care of this clustering were
cl.m

and

cl2004.m.


An example clustering is shown below. The data set being clustered is defensive
pass yardage. This cluster shown in blue re
presents only points that belong to
the 2
nd

cluster. The remaining cluster centers have been shown, however their
data points have been removed. The voronoi lines have been drawn to show
where one cluster ends and the other begins.




The reason cluster
ing was chosen as the engine of this operation was
for
its
versatility. From year to year, statistics can vary greatly. In a particularly cold
and rainy reason, little offensive production can lead to few statistics. A trained

9

Joseph Detmer


ECE 539


MLP or fuzzy set would hav
e a much more difficult time dealing with fluctuating
statistics
(from year to year) tha
n a clustering algorithm.

Rankings should be
created from how good a team is in a particular year, and not compared to other
years.


Once the clustering is complete an
d the output sent to files, two more perl scripts
reform the data into a very nice array of data. This array contains each teams
cluster results on one line, with the corresponding data set in each column. The
2005 data also is given a rank, which is ess
entially all the training give it. This
will be discussed later. The two files that reform the data are
toFunc.pl

and
2004toFunc.pl
.


Finally, a single number was calculated for each test set.
In order to calculate
this, a function first needed to be
created. From the training set (2005 data), we
used a
matrix

algebra trick. We called our inputs part of vector X. The rank for
each team is known as Y. We will call our coefficient matrix A. We know that


A
*

X = Y.

A = Y

*

X
-
1

However, since X is
not square, we cannot take the real inverse of it. Therefore,
we will take the pseudo inverse. Once we get our coefficient matrix, we apply the
coefficients

to our test matrix.

This gives us approximate rankings. The matlab
file that does this is
makef
unction.m
.


The data is then sorted with a final perl script (
getnewranks.pl
) and output to its
final file(
FINALRANKINGS.TXT
).


Results


#1. Southern California
-

167.3486

#2. Auburn
-

161.4341

#3. Oklahoma
-

116.2092

#4. Texas
-

112.4908


10

Joseph Detmer


ECE 539


#5. Miami (F
la.)
-

112.4448

#6. Virginia Tech
-

111.4653

#7. California
-

108.7134

#8. Florida St.
-

108.4097

#9. Utah
-

107.8165

#10. Louisville
-

107.5966

#11. Iowa
-

107.5766

#12. Boise St.
-

104.7642

#13. Georgia
-

100.8888

#14. Bowling Green
-

95.4476

#
15. Purdue
-

91.1031

#16. Virginia
-

88.6308

#17. Arizona St.
-

88.2669

#18. Texas

A&
M
-

86.3354

#19. Wisconsin
-

83.4357

#20. Navy
-

83.3217

#21. Fla. Atlantic
-

83.0395

#22. Ohio St.
-

80.6131

#23. Tennessee
-

79.4607

#24. UTEP
-

79.2717

#25. T
exas Tech
-

79.0969

#26. Troy
-

77.8624

#27. Pittsburgh
-

77.6789

#28. Michigan
-

76.7551

#29. Memphis
-

75.9943

#30. Georgia Tech
-

75.2869

#31. LSU
-

74.5189

#32. Miami (Ohio)
-

74.3935

#33. Fresno St.
-

73.4196

#34. Oregon St.
-

73.057

#35. Co
nnecticut
-

72.5143

#36. Northern Ill.
-

72.2317

#37. Penn St.
-

71.3406

#38. Florida
-

69.7872

#39. West Virginia
-

69.6947

#40. Oklahoma St.
-

68.7826

#41. Notre Dame
-

68.6967

#42. Minnesota
-

67.5152

#43. Hawaii
-

66.6198

#44. Arkansas
-

64.8
26

#45. Wyoming
-

63.4262

#46. New Mexico
-

62.9263

#47. Colorado
-

62.323

#48. Toledo
-

62.2786

#49. Boston College
-

62.1604

#50. North Carolina
-

61.1465

#51. Cincinnati
-

59.039

#52. South Carolina
-

58.1333

#53. Marshall
-

56.4976

#54. Arizon
a
-

55.924

#55. Clemson
-

55.6617

#56. Brigham Young
-

55.4913

#57. TCU
-

55.0452

#58. Kansas
-

54.9249

#59. UAB
-

54.3204

#60. Maryland
-

54.1745

#61. Iowa St.
-

52.9625

#62. Nebraska
-

51.3158

#63. Washington St.
-

50.6705

#64. Stanford
-

50.2
7

#65. Oregon
-

49.8981

#66. Alabama
-

49.7194

#67. Akron
-

49.5765

#68. North Carolina St.
-

48.8846

#69. Tulane
-

48.5183

#70. UCLA
-

48.1761

#71. Missouri
-

47.5022

#72. Kent St.
-

47.1419

#73. Southern Miss.
-

46.0345

#74. Northwestern
-

45.1
281

#75. Middle Tenn. St.
-

44.2224

#76. North Texas
-

44.1114

#77. Michigan St.
-

43.7986

#78. San Diego St.
-

41.1538

#79. Syracuse
-

40.7939

#80. Louisiana Tech
-

40.6867

#81. Nevada
-

38.9364

#82. Wake Forest
-

38.7285

#83. New Mexico St.
-

38
.2424

#84. La. Monroe
-

36.0539

#85. Rutgers
-

34.4375

#86. Kansas St.
-

32.1734

#87. Colorado St.
-

31.5946

#88. Vanderbilt
-

31.0494


11

Joseph Detmer


ECE 539


#89. Air Force
-

30.8899

#90. Central Mich.
-

30.5194

#91. Eastern Mich.
-

30.2132

#92. Baylor
-

29.4173

#93. Tu
lsa
-

28.4209

#94. Ohio
-

28.2544

#95. South Fla.
-

27.7194

#96. Mississippi
-

27.5643

#97. Western Mich.
-

26.2056

#98. La. Lafayette
-

25.4322

#99. Southern Methodist
-

24.6299

#100. Mississippi St.
-

23.9756

#101. Houston
-

23.5796

#102. Kentuc
ky
-

22.8338

#103. Utah St.
-

22.0091

#104. Temple
-

21.5889

#105. East Caro.
-

18.2943

#106. Indiana
-

18.2592

#107. Illinois
-

18.1962

#108. Ball St.
-

16.9403

#109. UNLV
-

16.4274

#110. Washington
-

15.3223

#111. UCF
-

15.3098

#112. Buffalo
-

14.6902

#113. San Jose St.
-

14.4281

#114. Idaho
-

13.9488

#115. Arkansas St.
-

12.7652

#116. Duke
-

12.7557

#117. Army
-

8.8534

#118. Rice
-

5.8056



Discussion


The rankings above are exceptionally close to the rankings of last year.
A
comparis
on poll can be seen at
http://sports.espn.go.com/ncf/rankings?pollId=2&seasonYear=2004&weekNumb
er=17

.

Looking at the top 25 teams from last year, the first four

are identical.
Also,
according to polls of the top 25 teams from last year, this algorithm classified 20
of those teams in its top 25. This may not seem good, but teams in the lower half
of these rankings often drop out from week to week, while teams ju
st outside the
top 25 come in.

It is difficult to say that I am wrong or
that the polls are wrong.
They are simply two different opinions, or two different points of view of the same
problem.


On the whole this approach seems to work. A better approach
would be to
compile several years worth of data and to create a function from that. I feel as if
this function would give a better representation of rank. This is due to a training
size which is the same size as the testing size.
There are
also
several
notes I

2

Joseph Detmer


ECE 539


would like to make on this rank. First, this algorithm can not truly be objective
until the
input
rank is determined objectively. This rank was obtained between
the use of computer models and human input. So in some way, this algorithm
does not
truly determine how good a team is. It is slightly subjective to previous
input.
However, I do not see how a computer could start making decisions on
how good a team is without first getting input from a human on what is good or
not. Whether it be train
ing or
a preset function. The human programming the
data will be the one who, at least primarily, determines who the “best” team is.


References


http://www.bcsfootball.org/


http://sports.espn.go.com/sports/tvlistings/abcStory?page=aboutbcs



Miscellaneous


Input File Names
(Relative to directory where scripts are)





where ? is either a 5 or 4, depending on year

200?data/kickr
et.htm


kick return yardage

200?data/passoff.htm


pass offense yardage

200?data/scoredef.htm


scoring defense

200?data/totaloff.htm



total offensive yardage

200?data/kickretdef.htm


kick return defense yardage

200?data/puntret.htm


punt return yardage

200?data/scoreoff.htm


scoring offense

200?data/turnover.htm


turnover margin

200?data/netpunt.htm


net punt yardage

200?data/rushdef.htm


rush defense yardage

200?data/passdef.htm


pass defense yardage


3

Joseph Detmer


ECE 539


200?data/r
ushoff.htm


rush offense yardage

200?data/totaldef.htm



total defensive yardage

200?data/sos.html



strength of schedule


Directories
(If using my scripts, you must create these directories before exec.)

doneclust

doneclust2004

parseddata

parseddata
2004

prefunc


Output File

FINALRANKINGS.TXT


Source Code

Only one copy of each script of the nearly identical scripts are attached. Both are
included in softcopy.

All are attached.