Talal Mufti
Ali JavadiAbhari
COS
42
4

Interacting with Data
Final Project Proposal
Problem
We are going to analyze
GPS data, and predict
the
snap weight (or map

matching value)
,
an
indicator of data point error,
based on the independent variables latitude, longitude,
velocity and
heading
.
As GPS

enabled devices continue to
proliferate
, there
seems to be
a huge database of day

to

day locations and transportations of people in the society, which offers great potent
ial for
analysis.
One important aspect of these is how well the transmitted data from the GPS
device fits the actual map. In other words, certain factors affect the precision of the data
with respect to the conventional tr
ajectory path (e.g. road), and kno
wing these is important
in understanding how well the GPS device performs under certain conditions, potentially
helping with its improvement.
The other benefit is with drawing a path for showing the
route a GPS device has traversed. If
the errors are not k
nown, then
they cannot be corrected
and the drawn path can deviate significantly from the true path.
Classically, these
snap weights a
re found based on an exact model of the map, and
subsequently
calculating the distance between the actual and transmitted data. This can be
troubling since it needs both a precise model of road coordinates for a large area, and
additional computation time for each data point. Furthermore, it does not give us insight
into what factors might have affected the
errors. In this project, we aim to use machine
learning to find out about these factors and predict the error
rather than calculate it using a
map
.
Data
We have obtained
data from
a popular
GPS
application, CoPilo
t Live, which is
available
on
many platfo
rms
. For a large number of users, we have the
ir latitude, longitude and velocity,
among other periphe
ral data. They are sam
pled at three

second intervals
as seen in figure 1
.
Figure
1
Samp
le of our data
In this dataset, we also have access
to calculated
sn
ap

weights for each sample. The
company has develope
d a representation of the obtained
values,
shown in figure 2
.
Methods
To predict snap weight from our independent variables, we propose
experimenting with
several machine learning algorithms as well as different sets of assumptions. To begin we
take the case in which we assume our response variables to be independent and identically
distributed. This assumption, as well as the vocabulary o
f independent and dependent
variables lead us to begin our experiments with regression.
Though the odds of successfully modeling snap weight using a linear regression on a
function of position, speed, and heading are slim, we can also use these regressions
to learn
more about the data by seeking patterns and ruling out our own hypotheses. Choosing
reasonable functions of the independent variables is a challenge as there seems to be no
literature on predicting snap weights without the use of the underlying
GIS (graphical
information systems) data i.e. data for the location of the corresponding road.
Still our intuition says that significant and sudden deviations in speed, acceleration, or
heading are cause to believe a data point had a high margin of error
–
cars do not suddenly
pull an about

face. Similarly, roads are known to be constrained to certain curvatures to
reduce banking (inclining
to one side of the vehicle due to angular momentum), therefore
sudden changes in heading can also be a potential indi
cator of an erroneous data point. A
regression line would capture this through deviation from the smooth line which we expect
when modeling speed or bearing. Should results look promising, we can then attempt to use
more advanced re
gression algorithms; lea
st
angle regression in particular seems appealing
as it is better at working with multiple potential covariates as we have here.
It is also reasonable to assume that since speed, heading, and even position can be
considered time

series (sequential) data, s
nap weight could possibly be modeled
sequentially as well, rather than as I.I.D. In our coursework so far we have only covered
Hidden Markov Models. Since we are currently most convinced by the variables speed and
heading, they can each be used as the obse
rved variable for a separate HMM model. In both
cases, the hidden variable will of course be the snap weight that we seek.
In an attempt to capture a greater degree of complexity which HMMs cannot, we will then
look into more advanced models. A preliminar
y search
le
d us to two potentially suitable
algorithms: Conditional Random Fields and Neural Networks, the latter of which was briefly
mentioned in our first reading. The goal here would be to look to consider
not
just the snap
weights as sequential data
,
but other variables as well
.
Evaluation
The natural measure of evaluation is the average error percentage between the predicted
data and the dataset. We have a lon
g vector of depende
nt variable data
and
the goal is to
minimize the average
offset of the sam
e column in our predicted
values using either L1 or
L2 norms.
For evaluating how well we did in predicting the errors, we
use a train/test/validation
approach. This means that we first hold out some of the data, called the test, and use the
remaining data
to fit a model that is able to predict the snap weights. To be able to estimate
the parameters in this model, we can again use two sets: training and validation. Validation
data is used to tune the parameters we find using the training data. At the very e
nd, to
evaluate our final model, we apply our model to the test data and measure the error we
encounter compared to the true data.
Contingency Plan
Should all of these methods fail to produce plausible predictors for the snap weight, it might
then be prud
ent to assume that it may be a function of the geographic loca
tion (see
clustering in figure 2
). To test this however, we would require more data with a much
higher degree of spatial overlap. That is to say that for every arbitrarily assigned regi
on (a
set
of geographic points)
we would require
many
data points all from that particular region.
Through this we can test to see if
perhaps
there are some intrinsic features of that region
which effect the GPS signal quality and therefore the snap value. For exam
ple a particular
region is near a factory and the smog affects the ionosphere or a dense forest canopy blocks
the GPS antennae of cars passing by there.
Figure
2
Snap Weight (snapValue) Representation
Comments 0
Log in to post a comment