Question 3: Data mining Helsinki Transportation

fantasicgilamonsterData Management

Nov 20, 2013 (4 years and 5 months ago)


Question 3




Data Mining Home Exam

Question 3: Data mining Helsinki Transportation

What information should YTV collect

Assuming there isn’t some limitation of what information can be collected, I
would urge YTV to collect as much information as possible.

Storage nowadays is relatively inex
pensive and if they decide that collecting a
particular attribute is of no use, then it can be discarded or pruned at a later

With some thought, a list of what could and should be collected has been
compiled with a brief explanation why:

The route nu

trains (metro and regional), buses, and trams have
route numbers attached to their journey route. They can be quite
informative like the 7A and 7B, however the 3T or 3B is quite confusing
to me at least as it seems to run in several directions. With

this data
you could find out if people are likely taking different routes to their

Transport type

the route number maybe be satisfactory, however to
cover all options, another field could be created to represent what form
of transport you a
re using in case of a conflict.

The direction

this is why I suggest to record the direction as for me at
least it gives me a better indication of it’s journey route.

The location

the corresponding location stored in some logical
method with the GPS loc
ator on board.

If it's stationary or moving

some people stamp their ticket as soon as
they board the transport vehicle, and some people do it later. It would
be interesting information to see when people use the machine. The
time could be compared agains
t external previous and next stop
timestamps. A GPS device may not be able to record if it’s moving or


the timestamp the transaction to find out exactly at what
time the transaction took place for record keeping. This should not only
be th
e time in seconds at the least, but the full date too. The date is of
course useful to know what times of the year the smart card is being
used, however keeping time to a lesser degree such as split seconds
could give a better insight to the operation of t
he card reader with it’s

Transaction ID

a unique id for every transaction.

Card type

there are many different types of card and by recording
card information problems such as error prone batches of cards can be
quickly identified.

Invalid e

if there is any reason to suspect an out of
the ordinary transaction, this field should be checked and recorded
along with all the other error type information including times and which
menus the error occurred on.

Question 3




Data Mining Home Exam

Credit status

how muc
h credit is left on the smart card, with the
appropriate units

Transaction type

what transaction type is favoured (time or money)

Smart card identification

a unique value to identify the card’s owner

Ticket machine id on the transport vehicle

which c
ard reader was
used, in order to find out which ticket machine isn’t used. This
information could improve customer service by placing card readers in
better locations or making sure they’re operable.

Transaction time

it is interesting to find out how lon
g an actual
transaction is. This could be done with another timestamp when the
smart card is removed. This information could be used to identify
machines that may need bigger displays or better buttons or better
instructions in order to improve efficiency
of use, with the goal of
exceeding customer service.

Possibilities of External data and their restrictions

Bringing in external data from trusted sources could bring further interesting
results to the data mining process. We could learn much more about the

customers of public transport, and more importantly how to better serve their

External data is ambiguous term for me, as does it mean external to the
stamping of the new smart card, or does it mean 3

party data?

Either way, we shall look at


Further external data that can be collected on board the transport vehicle,
hence supplied by us.


I have noticed some vehicles such as trams have a beam of
light crossing your path as you board the train. Hence with this device
u should be able to work out how many people are aboard the
vehicle at any one time, and find out how full it is. This statistic could
influence the buying behaviour and hence use of the smart card. For
e.g. on a busy tram a customer may not pay for his jo
urney thinking it is
unlikely for a conductor to check.

Stop and timekeeping history

a bus for example may not stop at
every one of its designated stops due to the fact that nobody is there,
or nobody has triggered the stop alarm on the bus beforehand. T
information could be computed with the smart card information in order
to possibly find interesting patterns. Such as if the bus stops often,
and/or is not keeping to it’s schedule the customer may have stopped
using the service and opted for his/her o
wn vehicle.

Ticket checker/conductor on board

with somebody onboard the
transport vehicle may have dramatic influences on buying behaviour.
Question 3




Data Mining Home Exam

This could be a good indication in order to fight fraud with people riding
“black”. However on the other hand, th
is constant checking of valid
tickets my deter people from using public transport.

Other 3

party data could be used in analysis in order to better understand
our customers such as their home and work address from the authority
that issued the smart card.

Technically we could reasonably discover the
details of a particular customers possibly daily journey.

We could again get details from other institutions where the particular used
the same smart again. For example while paying for some food, or even
by u
sing the smart card with the cellular GSM phone to pay for calls.
Smart card technology is making this sort of detailed information possible,
however there are problems. Firstly on a morale basis, we should be free
to do whatever we want without being fol
lowed or tracked. Liberty to do
what people want and remaining anonymous are rights cherished by
democratic societies. Collecting this sort of information may not agree with
some peoples’ views. This information could well help may directly
improve our ser
vice to the customer, however the customer may feel
he/she has lost his/her privacy. This may anger and create opposite
undesirable situation for the transport company.

Legally I am unfamiliar with the Finnish data protection laws, however they
are quite s
trictly liberal. In the UK photo ID is generally not used, as many
people don’t want a situation of where people are checking their papers at
every step, restricting their rights of freedom. Hence the UK has the Data
Protection Act, which forces many compa
nies to have a strict data
protection policy. If you believe an authority has your records, you can
legally ask them to show all records they may have of you. When
submitting personal data, all forms must contain a check box asking
whether a customer wants

his/her information available for analysis.

This British law in the context of public transport data mining in Helsinki,
won’t be a problem if people have given their consent and don’t ask for
their information too often as administrating such a large dat
aset is
technically expensive.

Methods presented throughout the Data Mining are applicable to this case
study. The resulting data set from the collected information will be very
large, and ever growing.

A data warehouse server would need to be established,

with the card
readers acting as its clients. However the data between these entities will
need to be filtered or cleaned in order to ensure consistent reliable data.
Furtherly the data may need to be partitioned and converted using a
Morphological Analysi
s program, all under the pre
processing stage.

As we have a large data set, it is only logical that we use one of the main
tools for knowledge discovery in databases, which is Association analysis.
Classification and prediction methods should also be stron
gly applicable,
Question 3




Data Mining Home Exam

as we want to enhance customer satisfaction. Accurately predicting and
forecasting consumer demands (and demands on the transport
infrastructure) using training data could accomplish this requirement.

Such a large dataset we would use eff
icient algorithms to find possibly
interesting frequent item sets using candidate generation, that is the
Apriori algorithm. As some attributes are multi
dimensional such as
location and possible card types in the future, we would naturally use
association rules to abstract information about the areas of

As these transactions are closely attributed to time, episode theory may
come in use in prediction analysis of the data. The knowledge is quite
valuable to us, we would need to employ the

proper expensive tools to
process i.e. analysis, query, report and present the knowledge for the
transport authority to effectively use. The whole data mining process will
be a carefully administered and employ all available data mining
to better understand the customer in order to extend and
enhance transportation services.


Course material and exercise 5 from 581550

Data Mining

Authors: Jiawei Han & Micheli
ne Kamber

Title: Data Mining

Concepts and Techniques