C01KOONIN-HOLLAND-Need-Tensions-FIRST-DRAFT-20130928 ...

parchedmoosupΗλεκτρονική - Συσκευές

29 Νοε 2013 (πριν από 3 χρόνια και 19 μέρες)

96 εμφανίσεις



DRAFT


1


CHAPTER 1

Identifying the Need and the Tensions: Privacy, Security & Transparency of

Big Data in the Governmental, Commercial, Academic and Personal Spheres.


Steven E. Koonin
, Michael
J. Holland

Center for Urban Science & Progress, New York University


The past few decades have seen rapid advances in technologies

to
acquire
, transmit, store, and

analyze all manner of data
.
The “
instrumenting of

society” as these technologies are widely

deployed is
allowing for

information of unprecedented granularity, variety, and coverage. Ci
sco
, the multinational
manufacturer of networking equipment,

estimates that

by 2015
there

will be two
networked devices
for
every person on the globe.
1

This flood of data
is increasingly
impact
ing

the
commercial, academic
,

personal, and governmental spheres
.

A decade ago,

it was

news
that
WalMart
was
using predictive

analytics to anticipate inventory needs in the face of upcoming severe weather events.
2


Today,
r
etail

(inventory management, online recommendation engines)
, advertising, insurance (
improved

stratif
ication of

risk), finance (investment

s
trategy,
fraud
detect
ion
), real estate, entertainment, and
political campaigns have all been moving to acq
uire, aggregate, and analyze large amounts of societal
data to improve their performance.

Basic research is also seeing the rise of “big data” technologies. Large federated data bases are
now an important asset in astronomy, the earth sciences, and
biology
. T
he social sciences are beginning
to be
grapple with the implications of this transformation
.
3

The
traditional
data paradigm of social
science
relies upon

well
-
designed surveys and experiments, both qualitative and quantitative
, as well as
exploitation of administrative records created for non
-
research purposes
. This
generates


clean data


from comparatively small samples
. This methodology
can

now

be

complemented by large volumes of
imperfect data
. I
f sampling errors, coverage errors, and
biases can be accounted for
, w
e believe the
combination

can yield new insights into
human behavior and social norms
.
At

the Center for Urban
Science & Progress, our
goal is data to can allow us to
quantify

the “pulse of the city
.


A new science of
cities
is beginning to emerge, with an understa
nding of how scaling laws and scientific simulations can
apply to transportation systems, energy use, economic activity, and innovation and a whole host of
other urban activities.
4
,
5
,
6



Governments are
exploring
whether making their data more open can help them b
ecome m
ore
participatory, decent
ralized, and agile institutions able

to

solve problems faster and mo
re successfully
on behalf of their citizens
while

increas
ing

the legitimacy of

democratic governance.

Sta
te and local
governments seek to deliver services more efficiently, to set better policies, to better plan infrastructure
improvements. The federal government is interested for many of the same reasons, but also to fulfill its
obligation to produce accur
ate national statistics.
Within the U.S., local and national law enforcement
and the Department of Homeland Security strive to understand what’s going on in society, as does the
intelligence community abroad. While these organizations are largely interest
ed in identifying individual
bad actors rather than broad behavioral trends, the technologies and methodologies are common to
other uses.
And
,

citizens are interested in urban data to ensure government
transparency and
accountability

as well as
to
enhan
ce the
ir local

government’s opportunities to improve urban living.
7

In
the realm of municipal governance,
“big data” can take us beyond today’s imperfect and often
anecdotal understanding of cities to enable better operations, better planning, and better
policy.
Putting urban data in the hands of citizens
can
improve governance and participation; in the hands of


DRAFT


2


entrepreneurs and corporations it will lead to new products and services for governments, firms, and
consumers. In short, it is now not a fantasy to ask “if you could know anything about a city, what do you
want to know” and to ponder what could be done w
ith that information.

Yet i
t is this increasing temporal and spatial granularity of data about individuals, the extent of data
collections in the hands of commercial or governmental organizations with interests
not necessarily
aligned with

those of indi
vidual
s
, and the increasing power of informatics tools to combine and mine
streams of

data
that stoke

concerns about p
rivacy and data acces
s.
Further development of

the
technical tools and administrative controls
required
to assure privacy and data securi
ty are
extraordinarily

important
precursor
s

to the
deeper
scientific

study of

cities.

An
Urban Data
Taxonomy

To
think about the

“know anything”

question,
one knows
that
cities deliv
er services
(shelter, safety,
security, health, food, water, waste, energy, mobility,
etc.)

to their citizens through

infrastructure and
through
process
es
.
We

want to know h
ow those systems operate
, how they
interac
t
,

and how they can
be

optimize
d.

There are three
classes

of
urban systems about which data needs

to be acquired:



The Infrastructure:

Major questions about urban infrastructure focus on its extent, condition,
and performance under varying scenarios of use.
W
e need to
know t
he
condition of the
built
infr
astructure:
A
re
the
bridge
joints corroding
?

Can we find the leaky pipes? Which pavement
resists excessive wear from heavy vehicles
?

We need to
understand

the
operation

of the
i
nfrastructure:
H
ow is
traffic
flowing?
I
s the
electrical grid

balanced?




The
Environment
:

Major questions about the urban environment focus on the sources and fates
of pollutants, the health burdens those pollutants place on vulnerable
subpopulations, and the
vitality

of natural systems
facing demands for
environmental services
.


W
e ne
ed to understand
whether a city’s river can support recreational uses such as fishing and rowing when
simultaneously allowing for nearby industrial uses. In addition to the usual

meteorological

and
pollution
variables
of interest, we need to understan
d the full range environmental factor
s
, such
as

noise
,

that

influence
people
’s

experience

of the city
day to day
.



The

P
eople
:

Major questions about urban populations focus on the interactions of people with
each other
, with institutions, and their
interactions as organizations

as well as their interactions
with the built and natural environments.
C
ities are built by and for people and
so cannot be
understood
without studying the people
: their

movement
,

health

status
, economic
activities
,
how they c
ommunicate,
their
opinions,
etc.


Urban data sources
.
From a purely scientific perspective, urban data naturally organizes itself into
these three broad categories according to how the data is generated


traditional text and numerical
data,
in situ

sensor

data, and synoptic
sensor data
, yet those categories are less relevant to the concerns
associated with privacy.
With respect to

privacy

concerns
, where

the data is generated, collected, or
contributed
matters

greatly



whether by government, private sector institutions or by individuals. We
will discuss these differences in greater detail later.

The first category

of urban data

is the t
raditional
text and numerical
records
that
agencies and
commercial entities

genera
te in the
ir

routine course of

business
. These administrative and transactional
data sources are the familiar records

such as
permits, tax records, public health

and

land use

data
or
sales, inventory and customer records

that social scientists have been ex
ploiting for decades, if not
centuries
, along with survey tools
.
Potential internet data sources include Twitter feeds, social media,
blogs, and news articles.
Text and numerical
records
can be aggregated
at the city level

(
census,


DRAFT


3


statistical bureaus
), at the
firm or
neighborhood level (
census blocks, tracts, neighborhoods
), or at the
individual level

(
retail sales records,
surveys)
.
With the migration of commerce, government, and
indiv
id
ual activities to digital sphere, the available volume of such d
ata is growing exponentially.

Next,

in situ

scanned and
sensor data
is the
most
rapidly growing
category of data relevant to

the
interests of

urban science. Enabled by increasingly cheaper microprocessor power and communications,
particularly wireless c
onnections, engineers are
rapidly developing methods
to instrument

i
nfrastructure
and
the environment

or extract

people’s movement

from commonly used personal electronic devices
.

8

The expanding “internet of things” enabled by the ease of scanning barcode
s or QR codes and the
plummeting price of RFID tags will only accelerate the stream of data related to object identity, location
and time of last movement.
P
articipatory

(
c
rowd
-
sourced
)

sensing of the environment
or

infrastructure
utilization
via mobile
phones is also feasible.
9
,
10

A
lthough

operational data such as traffic and transit
flows, utility supply and consumption, economic, and communications records also exist,
such
operational data streams
may be difficult to acces
s and aggregate for proprietar
y,
privacy reason
s or
both
.

Beyond fixed
in situ

sensors to record light, temperature, pollution, etc., personal sensors that
record location, activity, and physiology are becoming available.
While personal activity monitors such
as Fitbit are becoming po
pular among athletes and the quantified
-
self communities, a
pplications such as
assistive health care
for the elderly infirm
raise particular privacy concerns.
11


Finally, cameras and other synoptic sensors are
a rich new area for data relevant to urban
science.

T
here is a
n on
-
going

proliferation of video cameras at
points of commerce and automatic teller
machines

or
at
portals for pedestrians and

vehicles. Despite an estimated 30 million cameras in public
spaces in the US, very
little

of
the video colle
cted

is
analyzed,

other than
as

needed for
forensics
.

Traffic
scene surveillance

for congestion or

license plate

monitoring

is a major exception
.

Rapid automated
analysis of camera feeds is computationally challenging, but c
omputer vision enabled by unsup
ervised
machine learning is beginning to
open up new opportunities.
12

Platforms, such as YouTube, Pinterest or
flickr, for individuals to post images and video are proliferating. Sophisticated image processing tools
,

now becoming available as web apps
,

a
re able
to
construct 3D geometr
ies

from large, unorganized
collections of photographs
.
13

Remote sensing also offers new possibilities

for urban science
. While transient remote sensing from
satellites or aircraft is well
-
known, persistent remote sensing
from urban vantage points is an intriguing
possibility. Instrumentation on a tall building in an urban center can “see
,
” modulo
shadowing,
tens or
hundreds of thousands of buildings

within a 10 km radius, without the mass, volume, power, or data
rate cons
traints
of

airborne platforms.
As an example, d
iffering sampling rates
in the visible spectrum
allow for the exploration of different phenomena. At low sampling rates, we can watch new lighting
technologies pe
netrate a city

and correlate what is known a
bout early adopters or lagging adopters
from municipal permitting databases to tease out the behavioral and financial components of
energy
efficient lighting
technology diffusion. At very high sampling rates, transients observable in the lights
may provid
e a measure of other plug loads
with would only be accessible with expensive submetering.
Moderate sampling rates can reveal behavioral information.

Visible, infrared, hyperspectral, and radar
imagery are all phenomenologies to be explored for urban scene
s, as is Light Detection and Ranging
(LIDAR).
T
he synoptic and persisten
t coverage of such modalities
, together with their relatively easy and
low
-
cost operation
,

may offer a useful complement to
in situ

sensing.
Clearly this
monitori
ng from a
public vantage point raises issues of data collection, which
Strandb
u
rg
addresses in this volume.

How will the data be used?

Large urban datasets will be used in several different ways. One of the simplest is identification of
unusual data or
outliers. The distribution of observations of any given variable over a population may


DRAFT


4


reasonably be expected to be unimodal, although not necessarily normal. Large statistics, and control of
systematic trends, allows for clear identification of outliers
in such a distribution, which can then be
investigated in more detail.

An example is the energy use data from large buildings in NYC

(figure below)
.
14

The weather
-
normalized energy use intensity (annual kBtu/sq ft) of multifamily residential units is nom
inally normal,
while that for office buildings show a “fat tail” on the high side, with the most inefficient buildings
consuming energy at more than
15

times the rate of the most efficient. Investigation of the causes of
such differences is clearly of inte
rest (Data errors? Differences in occupancy? In activity? In
construction? In building operation?).

Large datasets will also be used to corroborate and evaluate simulations. As discussed below, an
important tool and product of urban informatics will be h
igh
-
resolution agent
-
based simulations
integrating mobility, land use, energy, health, economics, communications, etc.
These

datasets will be
essential to constructing, validating, and improving such simulations. It remains to be investigated what
observa
tions these simulations need to reproduce with what fidelity for a given purpose.

In addition to data linkage, correlation analyses will be useful in constructing and validating
behavioral proxies. For example, demonstrating that infrared images are we
ll
-
correlated with building
energy consumption in a small subset of buildings for which the latter are known directly (e.g., through
utility records) would allow accurate measurement of energy consumption for a much broader set of
buildings through synopti
c IR imaging.
In a more technical level, a dataset that lists family income by
address could be combined with visual synoptic observation of lighting by address to infer energy use as
a function of family income.

Urban data challenges

Despite the promi
se of urban datasets, t
urning the deluge of data into useful information and
understand
ing faces a number of challenges.


Disparate formats
.

As noted above, m
uch of the value of large datasets lies in their correlation


the
ability to combine two or more
datasets to infer new properties
, but that opportunity is not costless
.
T
he urban data sources
we are interested in

are extraordinarily
heterogeneous

in their character (text,
video, audio, mobility tracks,
and instrumental

readings)
. Their s
tructure,
cov
erage
, and quality

make



DRAFT


5


scalable analysis

a challenging task.
Classic database challenges such as p
oor naming standards
,
15

l
ack of
documentation
, or the absence of appropriate protections by the
data owner

for data integrity
are just a
few of the hurdles t
hat
can dramatically limit the utility of existing datasets.

For data collected for
multiple purposes from different organizations, data provenance will be a significant consideration.
16

R
ecording

provenance information so that basic questions can be
answered,

such as: Who created this
data product and when? When was it

modified and by whom? What was the process used to create the

data product? Were two data products derived from the same raw

data?

For non
-
public data obtained
by CUSP, we will have to

add information about the terms of use in data transfer agreements, who
accessed data, when and for what purpose.

Data Cadence
. Aside from the
ir

origins
, traditional
microdata

result
ing

from censuses, sample
surveys, administrative records, and statistic
al modeling
differ from

big data

in several important ways
as noted by Capps and Wright.
17

Much of the usual microdata encompass records numbering in the
hundreds of millions, while big d
ata
sets are many orders of magnitude greater. The computational
chal
lenges associated with massive data management are substant
ially different for static data
sets in
terms of scale and throughput. Technical advances are required to scale data infrastructure for curation
,
analytics, visualization,
machine learning, data min
ing,
as well as
modeling and simulation

to keep up
with the volume and speed of data.
18


Official statistics and data
sets tend towards periodic

cycles of input, analysis and

release


a
corporation’s quarterly earnings report or the Bureau of Labor Statistics’ Employment Situation
Summary on t
he first Friday of every month


while much of the data we would like to
access

for urban
science flow
s

continuously
. M
any
government

agencies
or corporations
would like to analyze that data
in real time for operational reasons
.
T
raditional
microdata
, including surveys,

tend
s

to be labor
intensive, subject to human error and costly

in their collection
, while
big data
are
often born digi
tal

and
seem relatively cheap

by comparison
.

Surveys, which form the foundation of of
ficial statistics
,

are the result of careful data collection
s

design
ed

with clearly defined uses, while
big data
come with unknowns (e.g., uses are less clear, data
are l
ess understood, data are of unknown quality, and representativeness is largely unknown).

Capps
and Wright also note that with respect to surveys,
resp
onse assumes permission to use.
Big
data
, on the
other hand,

come as byproducts of other primary activiti
es

and without asking explicitly.

Organizational disincentives.

To correlate data, it must be brought together.
D
ata
may not be
simply
difficult to analyze



or in the case of
petabyte

and larger data sets
hard to move
, data can be
difficult to obtain
.
O
rganizational
barriers and the
incentive

s
tructures of people within those
organizations can greatly complicate the

task

of obtaining data
.

In the commercial sector, proprietary
data
is valuable
and
even simple
information asymmetries
with respect to a fi
rm’s real costs
can be
profitable
.
I
n academia, the generation or access to unique data is often the researcher’s edge in
competition with peers
, is
a currency invested in cementing relationships with valued collaborators
, or
as Stodden notes,
19

withheld in

the interest of commercializing inventions derived from university
research
.

In the government sector,
d
ata held
with
in
a bureau

or offices
are
frequently a source of power and
influence that helps
define
the limits of organizational turf. P
ermissible
release of data to individuals
outside an agency is often


but not always


perceived to carry a risk for the agency

(e.g.,
potential
embarrassment
for

poor

performance or evidence of disparities in regulatory enforcement)

with little
identifiable benefit

to the agency itself
.
Such reticence only escalates within agencies when the data
held relates to a politically charged issue.

We should note that susceptibility to external social or political


DRAFT


6


pressure is not unique to the government sector nor is it un
iform. In the commercial sector,
retail
firms
are
more
likely to be sensitive
than data aggregators
, search firms

or social media companies, whose
paying customers a
re other firms (often retail advertisers) not individuals.


The more routine explanation f
or barriers to information release by government agencies is that, in
the non
-
market setting of government, demands on agencies always outpace available resources.
Where agencies cannot charge for their services, they must develop non
-
monetary costs to im
pose on
their clients as a means of rationing their outputs, including requests for information.
20

And so, their

willingness to share data, even data intended to be public, varies widely.
M
aking data easily available
can fall victim to
less nefarious caus
es

such as overtaxed staff, a failure of imagination
within the agency
that anyone would want all the available data,

or aesthetic considerations in design
of

a webpage
. As
one small

example, federal budget data is supposed to be one of the most readily available types of
data. One can easily find all relevant Congressional documents for the appropriations process in an
electronic format
(pdf)
on the Library of Congress’ Thomas websi
te back to 1998,
21

yet the Department
of Energy makes only
the most recent
10 years of its Congressional Budget Justifications available on its
CFO’s website
.
22

T
he Department has the documents in electronic format back to 1977, the year the
Department was e
stablished

and could very easily post them
, as the US EPA does back to 1967
.
23


Relationship of Big Data to

Open Government

[HAVE SOME HOMEWORK TO

DO ON THIS. IT WILL BE A SHORT, <1/2 PAGE, SECTION
24
]

Building upon a
nearly half century long
history of the Freedom of Information Act,
Sunshine Acts,
and Right
-
to
-
Know laws,
open government advocates need data scientists to help them sort through
and make sense of the vast troves of public data

With respect to the data itself, many NYC datasets a
re posted on the City’s open data website,
https://nycopendata.socrata.com/
.
However, the roughly 1,000 datasets listed show great variability in
their data quality, currency, and completeness
.


Tensions

In closing, a few b
rief remarks about the tensions

inherent in
the analysis of massive datasets
.

Transparency

versus Privacy
.
The value of any large urban dataset is enhanced through its
association with other data. Observations are linked through
location and time, as well as through
entity (person, firm, vehicle, structure). The power of such linkage in producing new information is
significant. For example, knowing an individual’s ZIP code localizes that person to 1 in 30,000 (the
average popula
tion of a ZIP code).
25

Linking a ZIP code with a birthdate reduces the pool to
approximately one in 80, while further connecting gender and year
-
of
-
birth are sufficient, on average, to
uniquely specify an individual.


There is a widespread assumption
, abe
tted by examples like the one above,

that information release
(sharing, flow) is synonymous loss of privacy.
We should recognize, however, that all data is not equally
intrusive and all analyses are not likely to
be privacy violating
. In cases where data

relates

people to
location
or people to their
social
network, the risk to privacy is likely greater than one in which the data
is solely about individuals, absent any
locational or
relational information
.
If privacy is to be understood
as a value worth p
rotecting it cannot simply mean secrecy, i.e. the withholding of information.
Nisssenbau
m
’s work has argued that it is the inappropriate sharing (flow, release, etc.) of information
not simply the sharing that violates privacy.
26


Understood this way, there

is no inherent conflict
between data utility and data privacy. There is only conflict between
particular

uses,
analyses or
disseminations

and privacy.



DRAFT


7


Periodically, suggestions arise that the solution to privacy concerns is increased
transparency

(a “pu
t
it all out there, we’ve got nothing to hide” ethos)
, yet

that can lead

to
data
security problems
.
Far more
careful analysis and improved techniques for assessing that balance need to be pursued.
Methods can
be further developed for estimating
re
-
identification risks for particular settings,
27

so that data scientists
when discussing the risks to privacy could use a likelihood language akin to what the IPCC uses to
describe the probability of a given outcome (“virtually certain,” “likely,” “extrem
ely unlikely”).
28

Research agencies and foundations supporting data science would do well to examine the precedence of
legal and social implications (ELSI) of genomics.
29

The tension between transparency in the
organizations
holding and analyzing data, dat
a security, and the privacy interests of those whose data is
being used are appropriate topics for such a program.

Asymmetry in
C
osts &
B
enefits
.
In an urban setting, particularly when analyses are relevant to the
operations of agencies or the development

and assessment of government policy, benefits and costs are
going to be distributed unequally.
Most obviously is the value of data. The

monetary

value of
individuals’ data is greater once included in
a large dataset
, a value
enhanced through
the ability

to
correlate with other data, than uncorrelated

data

is
to
any

individual alone.

Asymmetry in costs and benefits can manifest in subtle ways. With
respect to the “we’ve got
nothing to hide” ethos mentioned above, it could t
he data subjects do have something
they want to
keep confidential or even

hide
. In any case,

having their data exposed as a mechanism for deflecting
criticism of the data scientist

is not an equitable trade
.


Power Dynamics
.
It is frequently believed tha
t concentrated benefits or costs are far more
motivating in the political arena than are diffuse interests, even if the sum total of the diffuse benefits or
costs outweighs the concentrated sum


yet power still matters. Knowing how power is exercised in
government and having access to those wielding it matters greatly, and quantification brings an often
unexamined power and prestige to public policy debates.
30

Quantitative analysis can give the data
analyst greater standing or authority in a debate than th
e tacit knowledge of a blue collar worker or a
community member. Caution in
the interpretive power of data models is crucial, given the real
potential for harm in some cases.
31


Conclusions

We in the data science community who are interested in accessing p
ublic data with the goal of
improving our scientific understanding of how cities operate and how they can operate for the greater
benefit of the citizens also need to demonstrate that
making public data open can benefit the
agencies
and the civil servants
within them
.

Our goal is not just relevant research but impact,
which means

we
need to approach this goal with a degree of humility.
Data is not equivalent to information, and
information, when injected into management, policy and political spheres, is f
ar from determinant

as to
the outcome
. As Downs notes, top
-
level officials in government or any large organization tend to
become involved only in the most difficult situations.
The

challenge

for agency decision makers

is not
simply
procuring

additional

information but in assessing its significance in terms of future events
, events
around which there will always be some uncertainty
.
32


Moreover, agencies
are making those
decisions

within
competing philosophies of change. When
faced with politicians and
citizens whose outlook is trusting towards government, agencies are
encouraged to

incorporate more information into decision making as an expression of

greater

scientific
management or towards greater experimentation in approaches for meeting their mission
.
When faced
with politicians and citizens whose outlook is distrustful of government, agencies face aggressive
oversight and pressures to root out waste.
33

And so, urban science and data scientists interested in big


DRAFT


8


data need to continually be aware of th
e
context from which data comes,
the context in which analyses
are used to make decisions, and the context within which privacy concerns are balanced.


ENDNOTES




1

Insert Cisco reference.

2

Constance L. Hays, “What Wal
-
Mart Knows About Customers' Habits,”
New York Times
, November 14, 2004.

3

King, G. (2011) “Ensuring the data
-
rich future of the social sciences,”
Science
, 331(6018), 719

721.

4

M.Batty,

K.W.Axhausen,

F.Giannotti,

A.Pozdnoukhov,

A.Bazzani,

M.Wachowicz,

G.Ouzounis
, a
nd

Y.Portugali
,

Smart cities of the future
,”
Eur. Phys. J. Special Topics

214, 481

518 (2012)
;
DOI:

10.1140/epjst/e2012
-
01703
-
3

5

Luís M. A. Bettencourt, José Lobo, Dirk Helbing, Christian Kühnert, and Geoffrey B. Wes
t
,


Growth, innovation,
scaling, and the pace of life in cities
,”

PNAS

2007 104 (17) 7301
-
7306; 2007.

6

Bettencourt, L., Lobo, J., & Strumsky, D. (2007).

Invention in the City: Increasing Returns to Patenting as a
Scalin
g Function of Metropolitan Size,”

Research Policy, 36
, 107
-
120.

7

Insert ref for Datakind, hacks,
C
ode for America
.

8

Marta C. Gonzalez, Cesar A. Hidalgo, Albert
-
Laszlo, Barabasi, “
Understanding individual human mobility
patterns
,”
Nature
, v453, n5, pp. 779
-
782, June 2008;

doi:10.1038/natu
re06958
.

9

Nicolas Maisonneuve, Matthias Stevens
,

and Bartek Ochab,

“Participatory noise pollution monitoring using
mobile phones,”
Information Polity

15 (2010) 51

71 51; DOI 10.3233/IP
-
2010
-
0200

10

Wang, P., Hunter, T., Bayen, A.M., S
chechtner, K. &
Gonzalez, M.C., “
Understanding Roa
d Usage Patterns in
Urban Areas,”

Nature, Sci. Rep.

2
, 1001; DOI:10.1038/srep01001(2012).

11

T. Giannetsos, T. Dimitriou and N. R. Prasad
, “People
-
centric Sensing in Assistive Healthcare: Privacy Challenges
and
Directions,”

Security Comm. Networks

2011;
4
:1295

1307
;
DOI: 10.1002/sec.313

12

Anthes, G., Deep Learning Comes of Age,
Communications of the ACM
. Jun2013, Vol. 56 Issue 6, p13
-
15
;
DOI:10.1145/2461256.2461262

13

Microsoft Photosynth. Available at
http://photosynth.net/about.aspx
. (accessed September 18, 2013). For a
technical description of the method, see

Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon
, Brian
Curless, Steven M. Seitz, and Richard S
zeliski
, “
Building Rome in a Day
,”
Communications of the ACM
. Oct2011,
Vol. 54 Issue 10, p105
-
112
; DOI: 10.1145/2001269.2001293

14

New York City Mayor’s Office of Long
-
Term Planning and Sustainability,

New York City Local Law 84
Benchmarking Report, 2013
.
Available at
http://nytelecom.vo.llnwd.net/o15/agencies/planyc2030/pdf/ll84_year_two_report.pdf

(accessed September
28, 2013).

15

NYC government uses at least

three distinct identifiers for buildings, depending upon the agency and use: a
borough
-
block
-
lot number assigned by Department of Finance, a unique Building Identification Number (BIN)
assigned by City Planning, and the street address of residence or comm
ercial property used by most agencies.

16

Susan B Davidson, Juliana Freire,
Provenance and scientific workflows: challenges and opportunities
, Proceedings
of the 2008 ACM SIGMOD international conference on Man
agement of data, pp. 1345
-
1350 (
2008).

17

Capps,
C., Wright, T, “Toward a Vision: Official Statistics and Big Data,”
Amstat News
, 1 August 2013. . Available
at
http://magazine.amstat.org/blog/2013/08/01/official
-
statistics/

(accessed September 19, 2013).




DRAFT


9








18

National Research Council. 2013.
Frontiers in Massive Data Analysis.
Washington, D.C.: The National Academies
Press.

19

V. Stodden and I. Reich, “
Software Patents as a Barrier to Scientific Transparency: An Unexpected Conseq
uence
of Bayh
-
Dole,” Conference on Empirical Legal Studies, 2012. Available at
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2149717

(accessed September 29, 2013).

20

Anthony Do
wns,
Inside Bureaucracy,

Boston, MA: Little, Brown & Co., p. 188 (1967).

21

The Library of Congress Thomas,
Status of Appropriations Legislation for Fiscal Year 2014
. Available at
http://thomas.l
oc.gov/home/approp/app14.html

(accessed September 28, 2013).

22

Energy.Gov, Office of the Chief Financial Officer, U.S. Department of Energy,
Budget (Justification & Supporting
Documents)
. Available at
http://energy.gov/cfo/reports/budget
-
justification
-
supporting
-
documents

(accessed
September 28, 2013).

23

U.S. Environmental Protection Agency,
Historical Planning, Budg
et, and Results Reports
. Available at
http://www2.epa.gov/planandbudget/archive#Justification

(accessed September 29, 2013).

Since EPA was
established in 1970, this includes 3 ye
ars of budget data from predecessor agencies.

24

http://www.opengovpartnership.org/open
-
government
-
declaration
,
htt
p://www.fas.org/sgp/crs/secrecy/R41933.pdf
,
http://www.yalelawtech.org/privacy
-
who
-
can
-
you
-
trust/government
-
data
-
balancing
-
transparency
-
and
-
privacy/


25

Sweeney, Latanya. "Foundations of privacy protection from a comput
er science perspective." (2011)
. Available

at
http://dataprivacylab.org/projects/disclosurecontrol/paper1.pdfhttp://dataprivacylab.org/projects/disclosureco
ntrol/paper1.pdf

(accessed September 28, 2013).

26

Helen Nissenbaum,
Privacy in Context
:

Technology, Policy, a
nd
the

Integrity
o
f
Social Life
,
Stanford University
Press, 2010
.

27

Dankar, Fida Kamal; El Emam, Khaled; Neisa, Angelica; Roffey, Tyson.

Estimating the re
-
identification risk of
clinical

data sets
,”

BMC Medical Informatics & Decision Making.
2012, Vol. 12 Issue 1, p66
-
80.

15p. DOI:
10.1186/1472
-
6947
-
12
-
66.

28

IPCC, 2007: Climate Change 2007: Synthesis Report. Contribution of Working Groups I, II and III to the Fourth
Assessment Report of the Intergovernmental Panel on Climate Change [Core Writing Team, Pachauri, R.K and
Rei
singer, A. (eds.)]. IPCC, G
eneva, Switzerland; Appendix II, p. 83.

29

National Human Genome Research Institute,

ELSI Planning and Evaluation History
. Available at
http://www.genome.gov/10001754

(accessed Septe
mber 29, 2013).

30

Theodore M. Porter
Trust in Numbers: The Pursuit of Objectivity in Science and Public Life
, Princeton University
Press (
1996
).

31

Joe Flood.
The Fires: How A Computer Formula, Big Ideas, and The Best of Intentions Burned Down New York
City

and Determined the Future of Cities
. New York: Riverhead Books, 2010. In the late 1960s, New York City
Mayor John Lindsay hired consultants from the RAND Corporation to help modernize municipal service delivery

and achieve budget savings
. RAND recomm
ended an overhaul of fire station locations and the number of
engines responding to fires, based on flawed firefighter response time data.

When fire broke out in the Bronx,
firefighters were unable to respond in time, and fires ended up burning out of con
trol.

32

Downs,
op cit
., p 190.

33

Paul C. Light,
A Government Ill Executed
, Cambridge, MA: Harvard University Press, pp.164
-
166 (2008).