Towards Internet-Scale Multi-View Stereo

weakassuredAI and Robotics

Nov 6, 2013 (3 years and 5 months ago)

55 views

Towards Internet
-
Scale
Multi
-
View
Stereo

Yasutaka Furukawa, Brian Curless, Steven M.
Seitz, Richard Szeliski

[Computer Vision and Pattern Recognition, 2010]


HCI Final Report


Q56994049

蔡亦恒

醫學資訊所

碩一



Abstract


This paper puts
forward

a method to facilitate the reconstruction of 3
-
D imagery. Using existing
multi
-
view stereo methods, large
-
scale reconstruction of real world
environments

from unstructured
photo collections is demonstrated in this work.
Issues arising from the used of unstructured imagery
data, which often contains great disparity between images are also discussed. An algorithm is designed
with a multitude of procedures to

eliminate results of insufficient quality and at the same time serve to
enforce desired constraints. Extensive tests have been conducted on the proposed algorithm using
datasets consisting of images from Flickr.com, a well
-
known large
-
scale image hosting
service.


1.

Introduction

Since ancient times, humans have always strived to capture, moreover, retain images of the real
world. From cave paintings, to impressions on paper or canvas, and now photography either by film or
digital medium, the process to repro
duce what we
perceive

with the eye is constantly evolving.
With
the invention of computers along with the digitalization of photography, it is possible now to project
the real world in a way our ancestors were not capable of in the past. The computational
power of these
machines enabled us to manipulate images with more finesse, where in the past would be
dependent

on
the skills of the artist or photographer and relatively primitive tools.

Digitalization of modern
photography in recent years help to greatly accelerate the speed at which captured visual data could be
distributed. Combined with the development of internet technologies and web services, such data is
now made easily available t
o every corner of the world that has some form of internet access.

1.1

Motivation

Development of
image processing techniques has been around since imaging technologies
first was
conceived
. Digital image processing

techniques
, the application of
computer science
methods to manipulate image data was first heavily developed in the 1960s. However, the
costs
of processing data were

very
high

and existing hardware technology at the era were unable to
cope with the workload, save
for special dedicated s
ystems. Today, computing power and
hardware has become cheaper and much more readily available, allowing previous
developed

methods to be realized and applied. Coupled with the availability of large amounts image data
from the
World Wide Web
, the feasibili
ty of 3
-
D reconstruction of objects from around the world
is a great
opportunity

to explore.

1.2

Related Works

Stereoscopy has been explored by scientists as early as the 1830s. With modern day
computing power, processing of stereoscopic images is no longer li
mited to single views of
stationary images. Motion pictures using stereoscopy techniques are often seen today, giving
viewers an immersive experience as if participate in the scene itself. Traditionally, viewing of
stereoscopic images required the viewer t
o put on a special set of glasses to simulate a perception
of depth. The earlier stereoscopic images required the viewer to cross their eyes to trick the
mind

s perception of depth in the pictures. In recent years, development towards autostereoscopy
allows
viewers

to enjoy 3
-
D images without having to wear the cumbersome filter glasses. It
clearly be seen that stereoscopy is a very fast developing field of rese
arch.

For the reconstruction of 3D entities or
environments
, many aspects of knowledge has
already been
heavily research
.
Key components to this research include matching algorithms,
which are used to provide accurate
correspondences

between data, are well

developed, for
example: Scale
-
invariant
feature

transform (SIFT).
Structure from motion (SFM) algorithms for
estimating precise camera pose, as well as high accuracy multi
-
view stereo
(MVS)
methods for
the reconstruction of 3
-
D models from 2
-
D imagery are

readily available.

Utilizing the above
mentioned methods, some research groups have already experimented with 3
-
D reconstruction
and produced impressive results.

The greatest challenge of this research is scalability, how to approach the task of
reconstru
ction at the level of extreme amounts of data, such as the millions of images hosted on
the World Wide Web. One notable piece of work
on reconstruction of such scale
, is that of
Agarawal et al.

s research
, the Rome in a day project
.

Though in this research
, the main focus is
on Internet
-
scale MVS.

An MVS algorithm uses

several images to make correlation measurements for deriving 3
-
D
surface information.
In general, these algorithms aim at reconstructing a 3
-
D model with all the
input images. Taking in to consideration the complexity of images online, this becomes no longer
a good even if feasible approach. To tackle this issue, this research proposes a no
vel view
selection and clustering scheme to support working with massive photo sets. The idea when
dealing with such massive photo sets is to try only to retain a sufficient number of images such
that loss of details are minimized whilst processing is sped

up. Since the image set is partitioned,
it brings up a problem in reconstruction where features are lost at the borders of the clusters,
consequently the approach taken here is to create overlapping cluster to help retain details at
cluster edges. Further

consideration is given to speeding up the reconstruction process, by
designing the proposed algorithm in a fashion which supports the parallel processing of multiple
clusters.


2

Method

2.1

View Clustering

An assumption is first made that the input images are a
lready preprocessed using an
structure from motion (
SFM
)

algorithm to retrieve camera poses and a sparse set of 3
-
D


Figure 1.
Image Clustering


points.
The resultant points are taken to be sparse samples of dense MVS reconstruction. The
idea of view
clustering is to segregate the input image set in to an unknown number of
clusters, such that theirs sizes are manageable and each SFM point maybe reconstructed by at
least one to such clusters

C
k

(Figure 1).


2.1.1


Problem Formulation

Given the large numbers o
f photos in a photo collection, we must first somehow
organize the data such that it is
fit

for further processing. A clustering method is
applied to satisfy three important constraints: 1) exclusion of redundant images to
compact the data, 2)
cluster size

is suitable for MVS reconstruction, 3) minimal loss of
content and detail in contrast of reconstruction with whole dataset.

These constraints
help to eliminate problems arising from Internet photo collections, namely high
redundancy which leads to noisy r
econstruction due to an insufficient baseline.
Computing efficiency is also enhanced under these constraints as redundant data is
removed from the clustered photo collection.

The objective is to minimize the number of images

k
|
C
k
|

in the output clusters, with
consideration for two conditions:

i.

Each cluster fits an upper bound
,

|

C
k

|





,
such that it may be processed by an
MVS algorithm; this bound


is determined by computational resources.

ii.

Clusters provide enough coverage such that each SFM point is
covered

by at
least one cluster

C
k
.

An SFM point is said to be
covered

if it is “well
reconstructed”, which is defined by a function f(P,C) measuring the expected
reconstruction accuracy of a 3
-
D

location P by a set of images C.

A location Pj is
deemed covered if at least one of its covering clusters C
j

is at least


times of
f(P
j
,V
j
), the expected accuracy using all of P
j
’s visible images denoted V
j
. The


factor has been determined empirically to be 0.7.


Summary of Clustering Requirements






Key contributions of this formulation are as follows:

I.

Discarding redundant images by minimization ensures small data sizes.

II.

Clusters overlap automatically,
which reduces white spaces in reconstruction

III.

Image quality is inherently included, as poor quality images contain less SFM points
causing them to be eliminated in the selection process.


2.1.2

View Clustering Algorithm

The constraints presented before are not in

a form which can be processed by current
methods such as k
-
means or normalized cuts, therefore it poses a challenging problem.
Before going on to introducing the view clustering (Figure 2.), first, some definitions must
be made.

Given two images I
l
, I
m
, t
he image pair is said to be neighbors if

there exists an SFM point
visible to both images. This definition also applies for two image sets, if exists a pair of
images, one chosen from each set, are neighbors. For a pair of SFM points, Pj, Pk, the pair
is

n
eighbors if the following conditions are satisfied:

i.

Their respective visible image sets, Vj and Vk, are neighbors according to
above definition.

ii.

Projected locations of the SFM points are

within

1

= 64 pixels of each other
for every image in (V
j



V
k
).

The proposed
algorithm is composed of four steps, as seen in Figure 2.

In the first two
steps, input image datasets are preprocessed to retain informative images, followed by the
last two steps where actual clustering is performed according to the previous
ly proposed
constraints.


Figure 2. View Clustering Process


1.

SFM filter


merge SFM poi
nts

Each SFM point in the input data set is merged with its neighbors and outputted until
the input set is empty; each new merged point position is the average of its

neighbors
and input points of both sets are dropped. By merging SFM points in this fashion, the
total number of points to processed maybe greatly reduced thus boosting
computational efficiency. The merging process in addition serves to minimalize any
feat
ure losses that could affect visibility estimation.

(See Figure. 3)

2.

Image selection


remove redundant images

Each image in the input set is tested, and if the
coverage
constraints are still
valid
when the image is removed, the tested image is removed.
Removal of the image is

permanent

if that image fits the above constraint
, which helps to further speed up
later processes.

This testing process starts from images of lower resolution, which
allows for removal of less informative images.

3.

Cluster division


enforce size constraint

If a cluster violates the size constraint, then it is sp
l
it in to
smaller sets. The split is
performed using normalized cuts on a visibility
graph
with the images forming the
nodes.
The edge weight between two images

(I
l
, I
m
)

is an indicator of how the pair
together contributes towards MVS reconstruction on relevant SFM points.
Mathematically, the edge weight maybe
expressed

as follows:



=



(
𝑃
𝑗
,
{
𝐼

,
𝐼

}
)

(
𝑃
𝑗
,
𝑉
𝑗
)
𝑃
𝑗

Θ


Images with high contribution towards
MVS reconstruction would have higher
weights and be less likely to be pruned. Cluster division is complete when all clusters
satisfy the size constraint.

4.

Image addition


enforce coverage

When executing cluster division, it is quite possible that the coverage requirement is
violated. This fourth step is taken to add images to clusters in order to
ensure proper
coverage. A list of actions is drawn up based on effectiveness of adding an image

to a
cluster for increased coverage.

For an uncovered SFM point
P
j
, C
k

= argmax
Cl

f(P
j
,C
l
)

is the cluster with maximal
reconstruction accuracy. A corresponding action
{(I → C
k
),g}

adds an image of
I(

V
j
,

C
k
)

to , and
g =
f
(
P
j
,C
k



{
I
}
)


f
(
P
j
, C
k
)

is a measure of effectiveness. Images are
only added to the maximal cluster, since actions on the same image and cluster, such
actions can be merged. The generated actions are sorted in descending terms of
effectiveness, where actions of high effectivenes
s are chosen first. Due to
computational costs, only action candidates performing at least 0.7 times of the
highest score are considered.

Once an action is committed, the effectiveness of
similar operations may be affect, therefore if
conflicting actions
{
(I → C),g}

and
{(I


→ C

),g

}

where I and I’ are neighbors will be removed from the list.

In turn, it is possible that the size constraint is violated when enforcing the coverage
constraint,
thus steps three and four are repeated until both constraints are satisfied.


2.2

MVS Filtering and Rendering

Now that suitable images clusters are at hand, progress can begin on the actual 3
-
D
construction.

PMVS by Furukawa et al. is selected for the reconstruction of 3
-
D points, as
it is publicly available. Two filters: quality and visibility are proposed to deal with errors
and quality issues during reconstruction.

(See Figure. 6)



Figure. 3 SFM Filter (top), View Point Clustering (bottom)



Figure 4. Quality
Filter

2.2.1


Quality Filter

A surface region may be reconstructed by multiple image clusters which result in
differences in output quality; the goal is to retain good quality reconstruction and
eliminate noisy reconstruction. For a point P
j

reconstructed from

cluster Ck,
information on its MVS points {Q
m
} and visibility information {V
m
} is collected.
Information is collected on the basis of two rules: 1) the normals are compatible with
P
j
, 2) their projected locations are within

2

= 8 (empirically chosen) pix
els of every
image in V
j
. A
histogram is constructed based on the above collected information, if
the histogram value of P
j

is less than 0.5 time
s

the maximum value of the constructed
histogram, it is filtered out.

(See Figure. 4)

2.2.2

Visibility Filter

This fi
lter serves to enforce visibility over the entire reconstruction. It is similar to the
PVMS filter, differing in the fact that this filter emphasizes inter
-
cluster visibility
over the entire reconstruction.
A set of MVS points

k

is constructed from a reference
cluster C
k
, depth maps of images in the reference cluster is constructed by projecting
the MVS points.

A point P is said to conflict with the depth map if it is closer to the
camera by a small margin, moreover its reconstru
ction accuracy must be less than
half the accuracy value stored in the depth map.
If the overall conflict count of the
point P is greater than 3, it is filtered out.

(See Figure. 5)



Figure 5. Visibility Filter


2.2.3


Scalability

MVS reconstruction and filtering is the most computationally expensive and
intensive part of the system.
The most critical factor impacting on the scalability of
the system is memory complexity. Though choice of MVS algorithm has an impact
on system scalab
ility as well, since the clusters are all constrained to an upper bound
this is not an issue for this system.

Analysis of worst case requirements is very difficult under this system, so only the
average case shall be
analyzed
.

Let N
P
, N
I
, N
M

and N
C

each de
note the average pixels
per image, images per cluster
,

MVS points per cluster

and number of extracted image
clusters
.

The amount of memory required for the quality and visibility at a point in
time is subject to two requirements: 1) Both filters only need
at most at any point in
time store the MVS points from two clusters, 2) Only depth maps for one cluster are
stored by the visibility filter at any time. Therefore,
the average case memory
requirement for the quality filter is 2

N
M

and 2

N
M

+

N
P

N
I

for the visibility filter.

2.2.4

Rendering

Once the MVS points have been reconstructed, each point is assigned a color based
on the averaged pixel colors of its projection in visible images. The colored points are
further enhanced by doing a 3x3 up sampling on
the MVS points to improve
rendering quality. QSplat is then used to visualize the 3
-
D points.





Figure 6. MVS Filtering


3

Experimental Results


Experimentation of the proposed algorithm was conducted on a PC with Dual Xeon 2.27 GHz
processors.
Implementation of the proposed algorithm was in C++.

Datasets for the experiment as well
as SFM outputs were provided by the courtesy of Agarwal et al. For normalized cuts, the Graclus
method by Dhillon is chosen, with PMVS being the selected software to
r
econstruct

MVS. Finally, the
visualization of processed results is
outputted

with QSplat. Several parameters are associated to the
setup of the view clustering algorithm, which is constant for all the datasets process. Detailed values
for each set paramete
r are shown in Table. 1 below.


Table 1. View Clustering Parameters









150

4

0.7

0.7



Of all the processed datasets, Dubrovnik is the dataset spanning the largest real world region,
encompassing

the entire old city. The largest dataset, containing 14000 images, is of Piazza San Macro.
Statistics of each processed datasets along with processing details are shown in the following Table 2.

As can be seen in the table, many of the images are eliminat
ed during the selection process, especially
in the case of high overlaps and simple geometry.

Running times for each data set are provided with the
first number being the serial execution and the second number in parentheses for parallel execution.


Table
2. Dataset and Processing Statistics



Captures of

views from the reconstructed models are shown in Figures.

7 and
8
. Looking at these
captured images, we can see that when the MVS points are clustered in a mutual exclusive fashion,
there is a high occurrence of white spaces in the reconstructed image; whilst if the clusters were
overlapping the amount of white spaces w
as significantly reduced. Comparing to a previous

research conducted by the same authors, Goesele et al.’s depth
-
map based algorithm was used on the
dataset of St. Peter’s Basilica under unfiltered and image selection filtered variants of the dataset. Their
depth map estimation was scalable and paralleliz
able, however their Poisson
-
based depth map merging
was not. Depth map based MVS reconstruction is very noisy, requiring additional software to eliminate
the unwanted information.
Execution
times of the St. Peter’s Basilica dataset were

640 and 342 minutes

respectively for without/with image selection. This serves to reinforce that the image selection step is
both beneficial and applicable to other MVS algorithms as well.

Finally, experiments are done on how
different cluster upper bound sizes (

) would hav
e on the running time with both serial and parallel
executi
ons. It can be seen in Figure.
10
, with decreasing cluster size parallel performs better than serial
execution; though after a very low


threshold, parallel performance is slower, this is likely due to the
great numbers of redundant MVS points resulting from excessive overlapping of data clusters.



Figure 7. Reconstuction of Piazza San Marco (Venice)


Figure 8. Input image sample (top) with rendering of reconstructed models (bottom)




Figure. 9 Reconstruction with Goesele et al. depth
-
map based MVS

Figure
10
. Affect of

cluster size upper bounds on execution time


4

Conclusion

A novel MVS method is developed for dealing with very large scale unorganized photo
collections in this paper. From this developed method, we have demonstrated that reconstruction
of 3
-
D models from large unorganized photo collections is possible. The nove
l method facilitates
the
splitting

of photo collections in to overlapping clusters, such that each of these clusters may
be rendered in parallel. Well scaling algorithms for MVS filtering are also designed, which serve
to handle the problem of inconsistent

quality commonly found in unorganized photo collections.
Actual reconstruction of a
3
-
D model from online photo collections is shown
by the rendering of
several well
-
known locations in Europe from image collections hosted on Flickr.com.


5

Discussion

& Futu
re Work

The proposal of an algorithm at the level of such large scale reconstruction is amazing.
Plausibility of reconstruction from available imagery could potentially allow people to visit sites
from around the world without ever leaving their homes. It
could also be applied in medical
environments to allow for immersive diagnosis of patients or in crime scene reconstruction to
allow for accurate modeling of the committed crime. From the image captures shown, it can be
seen that not only is the reconstruc
tion on a large scale, it is of very high quality as well.
Moreover, the runtimes of the experiments conducted on actual reconstructions conducted in
parallel is quite short considering the large region covered and large amount of images
processed.

Assumin
g possible improvements towards computational power in future years, could it
be possible that an immersive virtual environment system be realize based on the reconstruction
of available imagery. In addition, given progress in photography technologies are
always
improving, giving us even better captures of entities; is it feasible to create a system that is
capable of conducting a progressive update on a reconstructed model without starting from
scratch, if a random better image of the model is provisioned?


Appendix


Building Rome in a Day

Agarwal has attempted to reconstruct Rome based on the retrievable images from Flickr.com.
Experiments are conducted with the constraint of testing how much can be
reconstructed within
24 hours.
It is demonstrated that it is possible to reconstruct a city from 150K images with 500
compute cores.
Though it is proven in the experiments that it is still very computationally
intensive to deal with such a massive amount of input data.
Their paper also
has suggested with
the development in social networks and geolocation information sharing, this information could
be used to further filter images and eliminate redundant imagery as well as
to better coordinate
reconstruction.


PM
V
S

PMVS is short for patched
-
based multi
-
view stereo. It is a software package developed for
supporting 3
-
D reconstruction from images.
Only rigid structures are reconstructed by this
software, so other elements such as pedestrians and animals are ignored.
Th
e software is
developed by
Yasutaka Furukawa

from the University of Washington and
Jean Ponce

of Ecole
Normale Supérieure.

PV
MS is currently in its second iteration,
with an improved faster algorithm
and greater accuracy. It can be used in tandem with
Bundler

by
Noah Snavely
, for automated
estimation of camera parameters from images. Coupled with support for 4
-
bit systems allows for
greater memory support.

Currently, the developers are also seeking for programmers familiar
with GPUs in order to

utilize GPUs to speed up the package.

Further details on PMVS:
http://grail.cs.washington.edu/software/pmvs/