Identifying the Need and the Tensions: Privacy, Security & Transparency of
Big Data in the Governmental, Commercial, Academic and Personal Spheres.
Steven E. Koonin
Center for Urban Science & Progress, New York University
The past few decades have seen rapid advances in technologies
, transmit, store, and
analyze all manner of data
society” as these technologies are widely
information of unprecedented granularity, variety, and coverage. Ci
, the multinational
manufacturer of networking equipment,
will be two
every person on the globe.
This flood of data
personal, and governmental spheres
A decade ago,
analytics to anticipate inventory needs in the face of upcoming severe weather events.
(inventory management, online recommendation engines)
, advertising, insurance (
risk), finance (investment
), real estate, entertainment, and
political campaigns have all been moving to acq
uire, aggregate, and analyze large amounts of societal
data to improve their performance.
Basic research is also seeing the rise of “big data” technologies. Large federated data bases are
now an important asset in astronomy, the earth sciences, and
he social sciences are beginning
grapple with the implications of this transformation
data paradigm of social
designed surveys and experiments, both qualitative and quantitative
, as well as
exploitation of administrative records created for non
from comparatively small samples
. This methodology
complemented by large volumes of
f sampling errors, coverage errors, and
biases can be accounted for
e believe the
can yield new insights into
human behavior and social norms
the Center for Urban
Science & Progress, our
goal is data to can allow us to
the “pulse of the city
A new science of
is beginning to emerge, with an understa
nding of how scaling laws and scientific simulations can
apply to transportation systems, energy use, economic activity, and innovation and a whole host of
other urban activities.
whether making their data more open can help them b
ralized, and agile institutions able
solve problems faster and mo
on behalf of their citizens
the legitimacy of
te and local
governments seek to deliver services more efficiently, to set better policies, to better plan infrastructure
improvements. The federal government is interested for many of the same reasons, but also to fulfill its
obligation to produce accur
ate national statistics.
Within the U.S., local and national law enforcement
and the Department of Homeland Security strive to understand what’s going on in society, as does the
intelligence community abroad. While these organizations are largely interest
ed in identifying individual
bad actors rather than broad behavioral trends, the technologies and methodologies are common to
citizens are interested in urban data to ensure government
as well as
government’s opportunities to improve urban living.
the realm of municipal governance,
“big data” can take us beyond today’s imperfect and often
anecdotal understanding of cities to enable better operations, better planning, and better
Putting urban data in the hands of citizens
improve governance and participation; in the hands of
entrepreneurs and corporations it will lead to new products and services for governments, firms, and
consumers. In short, it is now not a fantasy to ask “if you could know anything about a city, what do you
want to know” and to ponder what could be done w
ith that information.
t is this increasing temporal and spatial granularity of data about individuals, the extent of data
collections in the hands of commercial or governmental organizations with interests
those of indi
, and the increasing power of informatics tools to combine and mine
concerns about p
rivacy and data acces
Further development of
technical tools and administrative controls
to assure privacy and data securi
think about the
security, health, food, water, waste, energy, mobility,
to their citizens through
want to know h
ow those systems operate
, how they
and how they can
There are three
urban systems about which data needs
to be acquired:
Major questions about urban infrastructure focus on its extent, condition,
and performance under varying scenarios of use.
e need to
condition of the
Can we find the leaky pipes? Which pavement
resists excessive wear from heavy vehicles
We need to
Major questions about the urban environment focus on the sources and fates
of pollutants, the health burdens those pollutants place on vulnerable
subpopulations, and the
of natural systems
facing demands for
ed to understand
whether a city’s river can support recreational uses such as fishing and rowing when
simultaneously allowing for nearby industrial uses. In addition to the usual
of interest, we need to understan
d the full range environmental factor
of the city
day to day
Major questions about urban populations focus on the interactions of people with
, with institutions, and their
interactions as organizations
as well as their interactions
with the built and natural environments.
ities are built by and for people and
so cannot be
without studying the people
how they c
Urban data sources
From a purely scientific perspective, urban data naturally organizes itself into
these three broad categories according to how the data is generated
traditional text and numerical
data, and synoptic
, yet those categories are less relevant to the concerns
associated with privacy.
With respect to
the data is generated, collected, or
whether by government, private sector institutions or by individuals. We
will discuss these differences in greater detail later.
The first category
of urban data
is the t
text and numerical
te in the
routine course of
. These administrative and transactional
data sources are the familiar records
permits, tax records, public health
sales, inventory and customer records
that social scientists have been ex
ploiting for decades, if not
, along with survey tools
Potential internet data sources include Twitter feeds, social media,
blogs, and news articles.
Text and numerical
can be aggregated
at the city level
), at the
neighborhood level (
census blocks, tracts, neighborhoods
), or at the
retail sales records,
With the migration of commerce, government, and
ual activities to digital sphere, the available volume of such d
ata is growing exponentially.
category of data relevant to
urban science. Enabled by increasingly cheaper microprocessor power and communications,
particularly wireless c
onnections, engineers are
rapidly developing methods
from commonly used personal electronic devices
The expanding “internet of things” enabled by the ease of scanning barcode
s or QR codes and the
plummeting price of RFID tags will only accelerate the stream of data related to object identity, location
and time of last movement.
sensing of the environment
phones is also feasible.
operational data such as traffic and transit
flows, utility supply and consumption, economic, and communications records also exist,
operational data streams
may be difficult to acces
s and aggregate for proprietar
sensors to record light, temperature, pollution, etc., personal sensors that
record location, activity, and physiology are becoming available.
While personal activity monitors such
as Fitbit are becoming po
pular among athletes and the quantified
self communities, a
pplications such as
assistive health care
for the elderly infirm
raise particular privacy concerns.
Finally, cameras and other synoptic sensors are
a rich new area for data relevant to urban
here is a
proliferation of video cameras at
points of commerce and automatic teller
portals for pedestrians and
vehicles. Despite an estimated 30 million cameras in public
spaces in the US, very
the video colle
for congestion or
is a major exception
analysis of camera feeds is computationally challenging, but c
omputer vision enabled by unsup
machine learning is beginning to
open up new opportunities.
Platforms, such as YouTube, Pinterest or
flickr, for individuals to post images and video are proliferating. Sophisticated image processing tools
now becoming available as web apps
construct 3D geometr
from large, unorganized
collections of photographs
Remote sensing also offers new possibilities
for urban science
. While transient remote sensing from
satellites or aircraft is well
known, persistent remote sensing
from urban vantage points is an intriguing
possibility. Instrumentation on a tall building in an urban center can “see
hundreds of thousands of buildings
within a 10 km radius, without the mass, volume, power, or data
As an example, d
iffering sampling rates
in the visible spectrum
allow for the exploration of different phenomena. At low sampling rates, we can watch new lighting
netrate a city
and correlate what is known a
bout early adopters or lagging adopters
from municipal permitting databases to tease out the behavioral and financial components of
technology diffusion. At very high sampling rates, transients observable in the lights
e a measure of other plug loads
with would only be accessible with expensive submetering.
Moderate sampling rates can reveal behavioral information.
Visible, infrared, hyperspectral, and radar
imagery are all phenomenologies to be explored for urban scene
s, as is Light Detection and Ranging
he synoptic and persisten
t coverage of such modalities
, together with their relatively easy and
may offer a useful complement to
ng from a
public vantage point raises issues of data collection, which
addresses in this volume.
How will the data be used?
Large urban datasets will be used in several different ways. One of the simplest is identification of
unusual data or
outliers. The distribution of observations of any given variable over a population may
reasonably be expected to be unimodal, although not necessarily normal. Large statistics, and control of
systematic trends, allows for clear identification of outliers
in such a distribution, which can then be
investigated in more detail.
An example is the energy use data from large buildings in NYC
normalized energy use intensity (annual kBtu/sq ft) of multifamily residential units is nom
while that for office buildings show a “fat tail” on the high side, with the most inefficient buildings
consuming energy at more than
times the rate of the most efficient. Investigation of the causes of
such differences is clearly of inte
rest (Data errors? Differences in occupancy? In activity? In
construction? In building operation?).
Large datasets will also be used to corroborate and evaluate simulations. As discussed below, an
important tool and product of urban informatics will be h
integrating mobility, land use, energy, health, economics, communications, etc.
datasets will be
essential to constructing, validating, and improving such simulations. It remains to be investigated what
tions these simulations need to reproduce with what fidelity for a given purpose.
In addition to data linkage, correlation analyses will be useful in constructing and validating
behavioral proxies. For example, demonstrating that infrared images are we
correlated with building
energy consumption in a small subset of buildings for which the latter are known directly (e.g., through
utility records) would allow accurate measurement of energy consumption for a much broader set of
buildings through synopti
c IR imaging.
In a more technical level, a dataset that lists family income by
address could be combined with visual synoptic observation of lighting by address to infer energy use as
a function of family income.
Urban data challenges
Despite the promi
se of urban datasets, t
urning the deluge of data into useful information and
ing faces a number of challenges.
As noted above, m
uch of the value of large datasets lies in their correlation
ability to combine two or more
datasets to infer new properties
, but that opportunity is not costless
he urban data sources
we are interested in
in their character (text,
video, audio, mobility tracks,
. Their s
, and quality
a challenging task.
Classic database challenges such as p
oor naming standards
, or the absence of appropriate protections by the
for data integrity
are just a
few of the hurdles t
can dramatically limit the utility of existing datasets.
For data collected for
multiple purposes from different organizations, data provenance will be a significant consideration.
provenance information so that basic questions can be
such as: Who created this
data product and when? When was it
modified and by whom? What was the process used to create the
data product? Were two data products derived from the same raw
public data obtained
by CUSP, we will have to
accessed data, when and for what purpose.
. Aside from the
from censuses, sample
surveys, administrative records, and statistic
in several important ways
as noted by Capps and Wright.
Much of the usual microdata encompass records numbering in the
hundreds of millions, while big d
sets are many orders of magnitude greater. The computational
lenges associated with massive data management are substant
ially different for static data
terms of scale and throughput. Technical advances are required to scale data infrastructure for curation
machine learning, data min
as well as
modeling and simulation
to keep up
with the volume and speed of data.
Official statistics and data
sets tend towards periodic
cycles of input, analysis and
corporation’s quarterly earnings report or the Bureau of Labor Statistics’ Employment Situation
Summary on t
he first Friday of every month
while much of the data we would like to
would like to analyze that data
in real time for operational reasons
, including surveys,
to be labor
intensive, subject to human error and costly
in their collection
often born digi
seem relatively cheap
Surveys, which form the foundation of of
are the result of careful data collection
with clearly defined uses, while
come with unknowns (e.g., uses are less clear, data
ess understood, data are of unknown quality, and representativeness is largely unknown).
and Wright also note that with respect to surveys,
onse assumes permission to use.
, on the
come as byproducts of other primary activiti
and without asking explicitly.
To correlate data, it must be brought together.
may not be
difficult to analyze
or in the case of
and larger data sets
hard to move
, data can be
difficult to obtain
barriers and the
tructures of people within those
organizations can greatly complicate the
of obtaining data
In the commercial sector, proprietary
with respect to a fi
rm’s real costs
n academia, the generation or access to unique data is often the researcher’s edge in
competition with peers
a currency invested in cementing relationships with valued collaborators
as Stodden notes,
the interest of commercializing inventions derived from university
In the government sector,
frequently a source of power and
influence that helps
the limits of organizational turf. P
release of data to individuals
outside an agency is often
but not always
perceived to carry a risk for the agency
performance or evidence of disparities in regulatory enforcement)
to the agency itself
Such reticence only escalates within agencies when the data
held relates to a politically charged issue.
We should note that susceptibility to external social or political
pressure is not unique to the government sector nor is it un
iform. In the commercial sector,
likely to be sensitive
than data aggregators
, search firms
or social media companies, whose
paying customers a
re other firms (often retail advertisers) not individuals.
The more routine explanation f
or barriers to information release by government agencies is that, in
market setting of government, demands on agencies always outpace available resources.
Where agencies cannot charge for their services, they must develop non
monetary costs to im
their clients as a means of rationing their outputs, including requests for information.
And so, their
willingness to share data, even data intended to be public, varies widely.
aking data easily available
can fall victim to
less nefarious caus
such as overtaxed staff, a failure of imagination
within the agency
that anyone would want all the available data,
or aesthetic considerations in design
example, federal budget data is supposed to be one of the most readily available types of
data. One can easily find all relevant Congressional documents for the appropriations process in an
on the Library of Congress’ Thomas websi
te back to 1998,
yet the Department
of Energy makes only
the most recent
10 years of its Congressional Budget Justifications available on its
he Department has the documents in electronic format back to 1977, the year the
Department was e
and could very easily post them
, as the US EPA does back to 1967
Relationship of Big Data to
[HAVE SOME HOMEWORK TO
DO ON THIS. IT WILL BE A SHORT, <1/2 PAGE, SECTION
Building upon a
nearly half century long
history of the Freedom of Information Act,
open government advocates need data scientists to help them sort through
and make sense of the vast troves of public data
With respect to the data itself, many NYC datasets a
re posted on the City’s open data website,
However, the roughly 1,000 datasets listed show great variability in
their data quality, currency, and completeness
In closing, a few b
rief remarks about the tensions
the analysis of massive datasets
The value of any large urban dataset is enhanced through its
association with other data. Observations are linked through
location and time, as well as through
entity (person, firm, vehicle, structure). The power of such linkage in producing new information is
significant. For example, knowing an individual’s ZIP code localizes that person to 1 in 30,000 (the
tion of a ZIP code).
Linking a ZIP code with a birthdate reduces the pool to
approximately one in 80, while further connecting gender and year
birth are sufficient, on average, to
uniquely specify an individual.
There is a widespread assumption
tted by examples like the one above,
that information release
(sharing, flow) is synonymous loss of privacy.
We should recognize, however, that all data is not equally
intrusive and all analyses are not likely to
be privacy violating
. In cases where data
or people to their
network, the risk to privacy is likely greater than one in which the data
is solely about individuals, absent any
If privacy is to be understood
as a value worth p
rotecting it cannot simply mean secrecy, i.e. the withholding of information.
’s work has argued that it is the inappropriate sharing (flow, release, etc.) of information
not simply the sharing that violates privacy.
Understood this way, there
is no inherent conflict
between data utility and data privacy. There is only conflict between
Periodically, suggestions arise that the solution to privacy concerns is increased
it all out there, we’ve got nothing to hide” ethos)
that can lead
careful analysis and improved techniques for assessing that balance need to be pursued.
be further developed for estimating
identification risks for particular settings,
so that data scientists
when discussing the risks to privacy could use a likelihood language akin to what the IPCC uses to
describe the probability of a given outcome (“virtually certain,” “likely,” “extrem
Research agencies and foundations supporting data science would do well to examine the precedence of
legal and social implications (ELSI) of genomics.
The tension between transparency in the
holding and analyzing data, dat
a security, and the privacy interests of those whose data is
being used are appropriate topics for such a program.
In an urban setting, particularly when analyses are relevant to the
operations of agencies or the development
and assessment of government policy, benefits and costs are
going to be distributed unequally.
Most obviously is the value of data. The
individuals’ data is greater once included in
a large dataset
, a value
correlate with other data, than uncorrelated
Asymmetry in costs and benefits can manifest in subtle ways. With
respect to the “we’ve got
nothing to hide” ethos mentioned above, it could t
he data subjects do have something
they want to
keep confidential or even
. In any case,
having their data exposed as a mechanism for deflecting
criticism of the data scientist
is not an equitable trade
It is frequently believed tha
t concentrated benefits or costs are far more
motivating in the political arena than are diffuse interests, even if the sum total of the diffuse benefits or
costs outweighs the concentrated sum
yet power still matters. Knowing how power is exercised in
government and having access to those wielding it matters greatly, and quantification brings an often
unexamined power and prestige to public policy debates.
Quantitative analysis can give the data
analyst greater standing or authority in a debate than th
e tacit knowledge of a blue collar worker or a
community member. Caution in
the interpretive power of data models is crucial, given the real
potential for harm in some cases.
We in the data science community who are interested in accessing p
ublic data with the goal of
improving our scientific understanding of how cities operate and how they can operate for the greater
benefit of the citizens also need to demonstrate that
making public data open can benefit the
and the civil servants
Our goal is not just relevant research but impact,
need to approach this goal with a degree of humility.
Data is not equivalent to information, and
information, when injected into management, policy and political spheres, is f
ar from determinant
. As Downs notes, top
level officials in government or any large organization tend to
become involved only in the most difficult situations.
for agency decision makers
information but in assessing its significance in terms of future events
around which there will always be some uncertainty
are making those
competing philosophies of change. When
faced with politicians and
citizens whose outlook is trusting towards government, agencies are
incorporate more information into decision making as an expression of
management or towards greater experimentation in approaches for meeting their mission
with politicians and citizens whose outlook is distrustful of government, agencies face aggressive
oversight and pressures to root out waste.
And so, urban science and data scientists interested in big
data need to continually be aware of th
context from which data comes,
the context in which analyses
are used to make decisions, and the context within which privacy concerns are balanced.
Insert Cisco reference.
Constance L. Hays, “What Wal
Mart Knows About Customers' Habits,”
New York Times
, November 14, 2004.
King, G. (2011) “Ensuring the data
rich future of the social sciences,”
, 331(6018), 719
Smart cities of the future
Eur. Phys. J. Special Topics
Luís M. A. Bettencourt, José Lobo, Dirk Helbing, Christian Kühnert, and Geoffrey B. Wes
scaling, and the pace of life in cities
2007 104 (17) 7301
Bettencourt, L., Lobo, J., & Strumsky, D. (2007).
Invention in the City: Increasing Returns to Patenting as a
g Function of Metropolitan Size,”
Research Policy, 36
Insert ref for Datakind, hacks,
ode for America
Marta C. Gonzalez, Cesar A. Hidalgo, Albert
Laszlo, Barabasi, “
Understanding individual human mobility
, v453, n5, pp. 779
782, June 2008;
Nicolas Maisonneuve, Matthias Stevens
and Bartek Ochab,
“Participatory noise pollution monitoring using
15 (2010) 51
71 51; DOI 10.3233/IP
Wang, P., Hunter, T., Bayen, A.M., S
chechtner, K. &
Gonzalez, M.C., “
d Usage Patterns in
Nature, Sci. Rep.
, 1001; DOI:10.1038/srep01001(2012).
T. Giannetsos, T. Dimitriou and N. R. Prasad
centric Sensing in Assistive Healthcare: Privacy Challenges
Security Comm. Networks
Anthes, G., Deep Learning Comes of Age,
Communications of the ACM
. Jun2013, Vol. 56 Issue 6, p13
Microsoft Photosynth. Available at
. (accessed September 18, 2013). For a
technical description of the method, see
Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon
Curless, Steven M. Seitz, and Richard S
Building Rome in a Day
Communications of the ACM
Vol. 54 Issue 10, p105
; DOI: 10.1145/2001269.2001293
New York City Mayor’s Office of Long
Term Planning and Sustainability,
New York City Local Law 84
Benchmarking Report, 2013
NYC government uses at least
three distinct identifiers for buildings, depending upon the agency and use: a
lot number assigned by Department of Finance, a unique Building Identification Number (BIN)
assigned by City Planning, and the street address of residence or comm
ercial property used by most agencies.
Susan B Davidson, Juliana Freire,
Provenance and scientific workflows: challenges and opportunities
of the 2008 ACM SIGMOD international conference on Man
agement of data, pp. 1345
C., Wright, T, “Toward a Vision: Official Statistics and Big Data,”
, 1 August 2013. . Available
(accessed September 19, 2013).
National Research Council. 2013.
Frontiers in Massive Data Analysis.
Washington, D.C.: The National Academies
V. Stodden and I. Reich, “
Software Patents as a Barrier to Scientific Transparency: An Unexpected Conseq
Dole,” Conference on Empirical Legal Studies, 2012. Available at
(accessed September 29, 2013).
Boston, MA: Little, Brown & Co., p. 188 (1967).
The Library of Congress Thomas,
Status of Appropriations Legislation for Fiscal Year 2014
. Available at
(accessed September 28, 2013).
Energy.Gov, Office of the Chief Financial Officer, U.S. Department of Energy,
Budget (Justification & Supporting
. Available at
September 28, 2013).
U.S. Environmental Protection Agency,
Historical Planning, Budg
et, and Results Reports
. Available at
(accessed September 29, 2013).
Since EPA was
established in 1970, this includes 3 ye
ars of budget data from predecessor agencies.
Sweeney, Latanya. "Foundations of privacy protection from a comput
er science perspective." (2011)
(accessed September 28, 2013).
Privacy in Context
Technology, Policy, a
Dankar, Fida Kamal; El Emam, Khaled; Neisa, Angelica; Roffey, Tyson.
Estimating the re
identification risk of
BMC Medical Informatics & Decision Making.
2012, Vol. 12 Issue 1, p66
IPCC, 2007: Climate Change 2007: Synthesis Report. Contribution of Working Groups I, II and III to the Fourth
Assessment Report of the Intergovernmental Panel on Climate Change [Core Writing Team, Pachauri, R.K and
singer, A. (eds.)]. IPCC, G
eneva, Switzerland; Appendix II, p. 83.
National Human Genome Research Institute,
ELSI Planning and Evaluation History
. Available at
mber 29, 2013).
Theodore M. Porter
Trust in Numbers: The Pursuit of Objectivity in Science and Public Life
, Princeton University
The Fires: How A Computer Formula, Big Ideas, and The Best of Intentions Burned Down New York
and Determined the Future of Cities
. New York: Riverhead Books, 2010. In the late 1960s, New York City
Mayor John Lindsay hired consultants from the RAND Corporation to help modernize municipal service delivery
and achieve budget savings
. RAND recomm
ended an overhaul of fire station locations and the number of
engines responding to fires, based on flawed firefighter response time data.
When fire broke out in the Bronx,
firefighters were unable to respond in time, and fires ended up burning out of con
., p 190.
Paul C. Light,
A Government Ill Executed
, Cambridge, MA: Harvard University Press, pp.164