Text Mining to identify topical emergence: Case Study on Management of Technology

aspiringtokΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 3 χρόνια και 11 μήνες)

77 εμφανίσεις



-

1

-

Text Mining to identify topical emergence:

Case St
udy on Management of Technology
1

2


Alan L. Porter
*
,
David Newman
**

and
Nils C. Newman
**
*


*
alan.porter@isye.gatech.edu

Technology Policy and Assessment Center, Georgia Tech, Atlanta GA 30332
-
0345 (USA) and

Search Technology, Inc., Norcross GA 30992 (USA)


**
newman@uci.edu

Department of Computer Science, University of California, Irvine CA 92697 (USA)


***newman@iisco.com

IISC, Atlanta GA 30357

(USA)

and

UNU
-
MERIT, University of Maastricht, 6211 TC Maastricht (The Netherlands)



Abstract

We are investigating
methods to extract

technical intelligence from Science, Technology &
Innovation (ST&I) research publications, pat
ents
, and contextual records
. The
approach starts
with an explicit

statement of
the analytical objectives.

We then capture a document set

on an
ST&I topic
. Next we consolidate term variations and extract relationships to address the
questions at hand. This

research recognizes many challenges, including how best to “clump”
terms, effective statistical techniques to elucidate key relationships, and how to engage human
expertise

effectively
.


This paper advances these methodological aims by exploring

Manageme
nt of Technology


(MOT). We obtained the
abstracts from

the Portland International Conference on Management
of Engineering and Technolog
y (PICMET)

from 1997 through 2012. We analyzed these using
three approaches and compared their efficacy: 1) Topic Mod
eling, without human intervention;
2) Term Clumping and Principal Components Analysis (PCA) with
minimal
human intervention;
and 3) Term Clumping

PCA with
extensive

human intervention.


We

thus

identify promin
ent topical themes in the field

and compare these as derived by the
alternative methodologies. Such topics as intellectual property and supply chains top the topical
list in recent years, with interesting differences in regional emphases. Such findings can
contribute to decisions abou
t research priorities and opportunities.




1

This research has been supported by the Intelligence Advanced Research Projects Activity (IARPA) via
Department of Interior National Business Center contract
number D11PC20155. The U.S. Government is authorized
to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.
Disclaimer: The views and conclusions contained herein are those of the authors and shoul
d not be interpreted as
necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or
the U.S. Government.



2

We thank our colleagues for help with the methods applied here


especially Cherie Courseaul
t Trumbach

of UNO
and Webb Myers
and Douglas Porter

of IISC
. We thank those who rated the importance of the term groups


Dundar Kocaoglu, Tim Anderson, Cherie Trumbach, Scott Cunningham (along with two of us

Newman and
Porter).



-

2

-


Introduction

Beginning about 1993, we have been striving to extract usable Competitive Technical
Intelligence (CTI) from ST&I text resources. We have largely concentrated on search results of
field
-
structured da
ta from abstract record databases [e.g., a search on “Dye Sensitized Solar
Cells” (DSSCs) in Web of Science; plus a separate search for DSSCs in Derwent World Patent
Index]. Over the years we have advanced this “Tech Mining” approach (Porter and Cunningha
m,
2005). Recently, we have aimed at Forecasting Innovation Pathways (FIP) based on such text
mining, complemented with gathering of expert opinion (Porter et al., 2010; Robinson et al., to
appear).


In this paper we focus on the text mining stage of such

analyses
. The aim is to develop a
systematic approach that can be semi
-
automated. The advantages of automated text mining
include enhanced efficiency, analytical speed, and reproducibility. These contribute to making
CTI and FIP information more availa
ble for timely decision support. In so doing, we believe that
both ST&I policy
-
making and Management of Technology can be markedly enhanced. To a
striking degree, these fields rely upon tacit instead of empirical knowledge. By providing fast and
effectiv
e technology
analyses
, we aspire to change that.


Of special interest here is evaluating the degree to which automated text
analyses

can deliver
informative findings. To that end, we explicitly set out to compare three contrasting approaches:


1.

Topic
Modeling


a methodology entailing minimal human intervention

2.

Term Clumping with Principal Components Analysis (PCA)


with minimal human
intervention

3.

Term Clumping with PCA


with heavy human intervention

The key research question investigated here is: ho
w essential is human intervention in generating
informative technology profiles? In other
analyses

of more technical topics (e.g., DSSCs),
application of either Topic Modeling or Term Clumping/PCA, without significant human tuning,
tended to yield hard
-
to
-
interpret results.


The topic of study herein is “Management of Technology” (MOT). The case study entails
analysis of the abstract records of the Portland International Conference on Management of
Engineering and Technology (PICMET) from 1997 through 201
2. This case provides a
convenient test
-
bed in that the analysts are quite knowledgeable about the field. Hence, we do
not need to
grapple with two distinct
forms of expertise


1) concerning Tech Mining methods
and 2) concerning the topic under study (t
he more typical situation in which analysts address an
emerging technology with its somewhat arcane knowledge requirements).


In
addition, we believe that analyz
ing changes and relationships in MOT has inherent interest for
the STI2012 audience. Finding
s should be helpful in identifying MOT research opportunities and
concentrations of expertise in particular MOT sub
-
areas.





-

3

-

Data and Methods

Data

Portland State University colleagues who organize the PICMET conferences kindly have
provided the abstract records of the conference papers through 2012 (as of December, 2011).
Previously, we had
analyze
d PICMET content beginning with 1997 (Porter and Wat
ts, 2005). In
the early period, PICMET met in odd
-
numbered years


e.g., 1997, 1999, 2001, and 2003.
Country and organizational affiliation information is available from 2001. Since 2004, PICMET
has convened outside of Portland and the USA in even years
. So the series becomes annual from
2003 onward. Location affects participation; e.g., Turkey hosted PICMET in 2006 and Turkish
participation is exceptionally high in that year. Annual paper counts differed for 2003 and 2005
between our earlier
analyses

and the recent compilation from PICMET, so we use the recent sets
in these
analyses
.


The tally of abstract records for all these years is 5169. From 2001 on, there are 4239 records.
The records were imported into
VantagePoint

software
1

for the bulk of
these
analyses
.

Methods

Newman et al. (to appear, 2012) provide a well
-
targeted overview of text manipulation methods
pertinent to extracting intelligence from ST&I sources. There are several important aspects to
such Tech Mining:




Entity extraction


e.g., identifying term units (words or phrases) from the text for further
analyses



Data cleaning


e.g., consolidating term variations



Clustering


e.g., associating related topics, based on co
-
occurrence of terms in records
(sometimes
pursuing proximity
analyses
)

Note that the thrust here is on relationships among terms (over the record set). Another approach
goes the other way


associating records based on commonalities in how they use terms. These
are complementary approaches, with

distinct CTI payoffs. For instance, the record
-
oriented or
document
-
oriented analytics can yield “patent landscapes” wherein one views concentrations of
patent activi
ty based on shared topical foci (Boyack, et al., 2000)
.


For present purposes, we coun
terpose two approaches

to analyz
e term associations


Topic
Modeling and Term Clumping/Principal Components Analysis (PCA). We then vary the extent
of human involvement in the latter. Both approaches have roots in Latent Semantic Analysis
(LSA
--

Deerwes
ter et al., 1990). LSA
analyze
d text to associate terms based on their co
-
occurrence, thereby going well beyond basic operations such as stemming (e.g., to associate
“manager” and “managers” and, perhaps, “managing”).


Topic Modeling is especially suited
to text data since it is based on discrete counts. Resulting
topics can be easily unde
rstood as probabilities.


PCA factors can have positive and negative
values. A bigger difference is that Topic Modeling yields a set of “T” topics to explain the entire

text set (the corpus), whereas PCA computes the top “T” factors to account for the most variance.
So, learning a Topic Model with twice as many topics will yield finer
-
grained topics, whereas


-

4

-

computing double the number of PCA fact
ors should

leave the sam
e top
-
T factors. Topic
Modeling is also highly scalable, so can run on huge text sets.

Topic Modeling

Topic Modeling is a fully automated (not requiring human intervention) statistical method that
“learns” topical structure in a collection of text documen
ts (Blei et al., 2003; Griffiths and
Steyvers, 2004).
It simultaneously learns a set of topics to describe the entire collection, and the
topics most descriptive of each document. Formally, each topic is a probability distribution over
terms, and is typi
cally displayed by listing the ten to twenty most likely terms.

Topic Modeling
has
no need for an ontology, thesaurus or dictionary. Instead the topic model works directly and
only from the text data by observing patterns of terms that tend to co
-
appear i
n documents (e.g.,
candidate

and
election
, or
warming

and
climate
).


Topic Modeling counts word frequencies in each record, but ignores word order (proximity).
Raw text is pre
-
processed to term units


usually single words. As performed here, we set the
number of topics to be learned, T, to a desired value. Heuristics can guide the choice of T, e.g.,
based on corpus size, and prior belief about how many themes or subjects are discussed in the
collection. For the PICMET data, we learned topic models using

T=20 and T=40 topics. While
both topic models produced meaningful and interpretable topics, the T=20 topic model was
selected for later comparisons, to be more in line with the number of principal components used
in the evaluation.

Term Clumping with PCA

with Minimal Human Intervention

Term Clumping/PCA is our compilation of text cleaning and consolidation steps, followed by a
statistical method to associate terms based on their tendency to co
-
occur in the records of the set
under study. We are working t
o systematize these steps to extract terms, consolidate term
variants, and then group them. We then perform our version of PCA in
VantagePoint
; PCA is a
basic form of factor analysis. As mentioned, we perform two versions of Term Clumping/PCA


one with
extensive human tuning, and one with minimal human involvement. Here is a summary
of the
semi
-
automated
steps, working with the 5169 abstract record set.


a.

Field Selection
: The PICMET records do not contain index terms or keywords. The
fields of interest to learn about topical emphases, thus, reduce to 1) title and 2) abstract
--

words or phrases. Here we use
VantagePoint’s

Natural Language Processing (NLP) to
extract no
un phrases from the titles, and separately from the abstracts. We then merge
those two fields to get
86014
noun phrases.

b.

Basic Cleaning
: The 86014 Title +

Abstract phrases reduced to

76398
by use of
VantagePoint’s

general.fuz “fuzzy matching” routine. T
his consolidates terms with
shared stems and other phrase variations expected to be highly related concepts. [While
VantagePoint

offers the option of human checking of the consolidated terms, these were
accepted as is.]

c.

Further Cleaning
: In VantagePoint,
we applied several thesauri without significant
human intervention (“.the” files) to further consolidate term variations and cull “noise.”
These include both very general and quite topic
-
specific collections:



-

5

-



stopwords.the


a standard thesaurus provided
with the software that uses Regular
Expression (RegEx) to batch some 280+ stemmed terms as “stopwords” [e.g.
--

the, you, and, are, single letters, and numbers]


reducing to 76105 terms



common.the


over 48,000 general scientific terms; when appli
ed, this
reduced to
73232 terms [
http://www.nottingham.ac.uk/~alzsh3/acvocab/wordlists.htm
]

d.

Additional Cleaning:
Ran
VantagePoint’s
list cleanup routine, again, using a variation
of
the base

routine



“general
-
85cutoff
-
95fuzzywordmatch
-
1exact.fuz” As the title hints,
this was derived by varying parameters offered by the software to adjust fuzzy matching
routines. This reduced the set to 69677 phrases. [Given that MOT terminology tends to be
heavily “general,” there could be concern about losing some valuable information.]

e.

Consolidation:
Ran a macro devised by Cherie Courseault Trumbach and Douglas Porter
of IISC to consolidate noun phrases of differing numbers of terms. As described by
Trum
bach and Payne [33], this concept
-
clumping algorithm first identifies a list of
relevant noun phrases and then applies a rule
-
based algorithm for identifying synonymous
terms based on shared words. It is intended for use with technical periodical abstract

sets.
Phrases with, first, 4 or more terms in common; then 3; then 2; are combined and named
after the shortest phrase, or most prevalent phrase. In case of conflict, it associates a term
with the most similar phrase. This reduced the set to 68062 terms
.

f.

Pruning:

Removed phrases appearing in only 1 record


now down to
13089
(after
removing 2 cumulative trash terms


“Stop words” and “Common” from Step C). So, this
is clearly the critical reduction, albeit an extremely simple one to execute.

g.

Parent
-
Chi
ld Consolidation:
Ran a macro devised by Webb Myers of IISC originally to
consolidate junior authors under a more senior collaborator; here, to consolidate term
phrases


macro = Combineauthornetworks.vpm. This reduces the set to
10513

terms.

To give a
sense of term prevalence, here are some frequency benchmarks:

-

occurr
ing in 100
-
475 of the records:

50 terms [excluding one blank

term]

-

in 50
-
99 records:



62 additional terms

-

in 25
-
49 records:



186

terms

-

in 10
-
24 records



763 terms

h.

PCA:

VantagePoint’s

“factor mapping” [and/or matrix] routine that applies Principal
Components Analysis (PCA) was then run on selected groups of these terms. In the past,
we have found interesting results from a 3
-
tier PCA process, based on the percentage of
r
ecords in which terms appear. We don’t have a set formula for this, but one could cut on
the basis of % of records (e.g., 2%; 1%; and <0.5% to get at “fast
-
breaking” novel
thrusts). Or, one could set thresholds based on record coverage by each set of ter
ms
collectively (e.g., the top N terms touch on 90% of the records; the next M terms touch on
20%; the bottom L terms include 3% of the records).


Our experience with PCA on ST&I datasets has often yielded fruitful results with term
sets in the “low hundr
eds


(e.g., 100
-
300) terms to be factor analyzed. In that the mission
here was not to massage the process to get good results, but to test an automated process,
we do not clean out any terms that have made it through Steps a
-
g. Nor do we try


-

6

-

multiple cut
-
points. Rather we set the thresholds arbitrarily based on the record set
(5169). We go for three PCA
analyses
:


-

High:
On the top 112 terms (those appearing in 50
-
475 records) [collectively in 4567 of
the 5169 records, or 88%]

-

Medium:
On the next
186 terms (appearing in 25
-
49 records) [collectively, 83% of the
records]

-

Low:
Because there are so many terms (10513), we just go for a next tier, not for those
appearing in few records; so the 3d PCA treats the 763 terms in 10
-
24 records
[collectively

in 83% of the records]

i.


PCA Results:

Again, to minimize human tuning, we simply run PCA (factor map in
VantagePoint
) using the default number of factors to extract. We also accept the default
cut
-
points to distinguish the high
-
loading terms to make up e
ach factor (Principal
Component).

Results are encouraging. Having glanced at the terms ahead of time, we were concerned in that
there were a number that we would have liked to have cleaned out


e.g., “technology” or
“department” (too general); or “st Cen
tury” or “i e” (noise). We also deferred to the automated
factor
naming


simply the highest loading term.


Upon scanning the three semi
-
automated PCA results [
those that were rated most highly, from
whatever source, are
in
cluded in Appendix Table A as H#
, m# or l#

--

full sets available on
request
], the factors look pretty coherent and relevant to MOT. Some of them appear useless, but
not bizarre


e.g., one High factor includes three terms: U S, department, and United States. A
Medium factor includes US
A and UK (not so interesting, but presumably mentioning both
countries in some comparison). These terms also suggest that further cleaning would have been
good


i.e., to combine U S, USA, and United States.


Upon scanning the factors and their terms, o
ne sees room for human enhancement. We might
rename the Low Factor, “fossil fuel,” to “energy” to get the gist of the theme that these leading
terms suggest. Some factors do not a
ppear worth attention (e.g., a

Medium factor of “USA”).
Some terms don’t a
dd much intelligence value (e.g., “relevant literature” on the Low Factor,
“manufacturing organization”). Several factors just combine terms that address the same thing


e.g., the Low Factors, “RFID” and “New Product Development NPD.” Some make you thin
k


e.g., the Low Factor, “scarce resou
rce”


about

a common theme. A few terms associate with
more than one factor.


The intent of this tri
-
factor PCA is to provide insight into both broad and detailed sub
-
themes
within the topic domain. The
VantagePoint
factor mapping (PCA) process sets thresholds on the
number of terms associated with a given factor to improve clarity [statistically, every term
analyze
d has some degree of positive or negative association with each factor]. This is based on
an algorithm that considers the relative gap size between ordered terms’ factor loadings, the
number of terms, and the absolute loadings. The resulting factors are selective; they do not cover
as wide a swath of terms as would the entire term set from whi
ch they are drawn. These Low
Factors cover 14% of the records; the Medium, 19%; and the High, 53%.




-

7

-

For present purposes of observing whether the minimal intervention Topic Clumping/PCA
analyses

can yield informative results, we don’t cull or edit the fac
tors, but prese
nt some
tabulations. We include

the 7 High, 9 Medium, and 16 Low Factors. In terms of record
coverage, the High Factors cluster at the top, but, the Medium intrude somewhat. The Low
Factors cluster at the bottom; again, with some overlap
with the Medium Factors.


In another paper (Porter et al., 2012), we appl
y the factors to study

MOT activity. Notable
changes in emphasis over time emerge, including:



High Factors (general): Increased attention to uncertainty; in contrast, emphasis on new

product development has remained relatively steady over time.



Medium Factors: Interest in converging technology has varied extremely with strong
peaks in 2004 and 2007; Continual improvement has been very hot, but seems to drop off
this year; conversely,
ICT seems of special interest this year; focus on DEA (Data
Envelopment Analysis) fluctuates wildly.



Low Factors (addressing more specialized foci): Attention on Customer demand leaps up
this year; Fossil fuel (energy) jumps up in 2011/2012.

We also compa
red the factors by geographical region, spotting such differences as: organizational
topics draw less East Asian attention, but manufacturing industry is of high interest; DEA
engages fewer Europeans, but they are drawn to organizational issues; fossil fue
ls is of most
concern to North American MOT researchers.

Term Clumping with PCA with Considerable Human Tuning

Porter et al. (2012) prepared an analysis of the PICMET papers’ content for presentation at
PICMET
-
2012. As in the Term Clumping/PCA with
minimal human intervention, the topical
analysis focused on the title and abstract noun phrases, combined. However, in this case we drew
heavily upon human interpretation. We “borrowed” (i.e., imported

lists of) select terms from a
prior

analysis (Porter

et al., 2005). Using
VantagePoint’s

“MyKeywords” function we created a
group of highly MOT
-
relevant terms and tracks the records in which those appear. We then
scanned the current list of title and abstract NLP phrases occurring in 5 or more of the 2011
-
2012
abstracts to capture possible new research interests in the field. We combined the old and new
MOT terms to get 597 noun phrases. We then re
-
cleaned this list to consolidate to 522 phrases as
the topics for further
analys
e
s.


We divided the 522 phra
ses into three groups based on the number of records in which they
appear and ran PCA on each. Upon review of the resulting PCA factors, we determined that
further data consolidation (e.g., to combine terms like Data Envelopment Analysis with their
acrony
m, DEA) would help. This reduced the MOT term set to 435 phrases, upon which we
reran three PCA
analyses
.


Table 1 shows the factors that emerged for the Medium and Low frequency MOT term sets.
They, respectively, only encompass a small percentage of t
he 5169 PICMET records (11% for the
Medium set; 7% for the Low set). So, for present aims of observing large scale topical emphases
and trends, we focus on the High set that covered some 90% of the records. One might pursue
these more specialized factors

to study emerging MOT topics.





-

8

-

Table 1. Medium and Low Frequency Term Factors


Medium Factors

Low Factors

commerce

integrative modeling

alliances

managing information

TQM

Australia

mergers

ethanol

nanotechnology

technical problem solving

social
capital

graduate course


BPR


know how


non linearity


energy consumption


organizational changes


The High occurring MOT terms set yielded 10 factors (principal components). Two of those
centered on “innovation”


a term occurring in 1185 of the 5169 records. Those two factors’ top
terms are included in 42% and 33% of the records. To sharpen contras
ts, we reran without the
term, “innovation”; examined the results; requested additional factors; and settled on a set of 14
factors.
The MOT terms/phrases were then pruned a bit more to remove terms that related (i.e.,
co
-
occurred significantly with the o
ther terms loading on that factor), but were inherently very
broad. For instance, “Japanese” and “R&D” loaded with “patents and intellectual property,” but
would reflect quite varied MOT research interests; “growth” related to “global issues,” but can be
used in so many ways. This reduced the MOT terms included on the 14 factors from 53 to 44
.


How do the factors compare with those generated by a similar process with minimal human
intervention? They are very different! Only 1 of the above 14 human
-
indu
ced factors has a clear
‘twin’ in the High, Medium and Low PCA factors generated with minimal human intervention. 4
of 14 have a counterpart “but” not really the same composition; 2 more find a factor with some
intersection, but quite different. And 7 of

14 find no counterpart. So, analysis of the MOT field
based on the human
-
conformed factors is quite different from that based on the semi
-
automated
PCA
analyses
.

Human Assessment

We consolidated the various term/phrase sets from these
analyses

of the PIC
MET abstract records
into a list of 66
term groups: 20 topics generat
ed by the Topic Model; 14 human
-
assisted PCA
factors; and 7 High, 9 Medium, and 16 Low frequency semi
-
automated PCA factors. We
presented these via an e
-
mail survey to 6 colleagues with

extensive knowledge of MOT and
familiarity with PICMET. We asked for their quick judgment using a 0
-
1
-
2 scale on whether the
term group gets at important thrusts in the field. The Appendix shows the email, to which was
attached an Excel file containing
the 66 groups with a blank column in which to insert their
rating.



-

9

-

All responded. Two noted concerns, pasted in the Appendix, but rated the term sets as requested.
Inter
-
rater reliability averaged 0.35. The 15 pairwise comparisons among the 6 raters r
anged
from 0.18 to 0.52. No rater stood apart as either less or more closely aligned with the others. Our
“ideal” result would have been considerably higher

inter
-
rater agreement
, though we had no
expectation that such general terminology and complex fact
ors would yield correlations
approaching 1.00. The results suggest that there is a meaningful sense of how these term sets
pertain to MOT research, but considerable noise in the assessment process. We, thus, averaged
their ratings to order the term sets
in terms of importance to MOT research assessment.


Appendix Table A presents the
top
ordered ratings of the 66 term groups

[truncated because of
page limit]
. At the top, 5 groups rate 2.0; 6 rate 1.83; 17 rate 1.67; another 5 rate 1.5. Half of the
66
rate at least 1.5. From the bottom, 3 term groups rate 0; and another 10 also rate less than 1.


Table 2 spotlights the assessment of the term groups (
some
individual
term group

ratings and
coverage appear in Appendix Table A). Our hypothesis was that th
e human
-
assisted Term
Clumping/PCA factors would outperform the more automated approaches; they don’t. This is
surprising and encouraging to the prospects for text mining of valuable ST&I intelligence. One
can peruse a) mean rating, b) number of highly r
ated term sets, and/or c) percentage of the term
sets rated highly. Results correspond well. The Low frequency PCA sets show promise at
getting at important MOT topics. The Topic Models show, perhaps, the most promise. This is
compelling in that the pr
esent Topic Models derive from single
-
word
analyses
, whereas the PCA
results are based on NLP noun phrases.



Table 2. Comparing Human Assessment on the Term Clustering Approaches


Type

# of
Term
Groups

Mean
Rating

# in
Top 28

% in Top
28





Rating
>
=1.67

Rating
>=1.67

Topic Models

20

1.45

10

50%

Human
-
Assisted Term
Clumping/PCA

14

1.39

6

43%

Semi
-
automated Term
Clumping/PCA
-
High frequency

7

1

1

14%

Semi
-
automated Term
Clumping/PCA
-
Medium frequency

9

1.19

3

33%

Semi
-
automated Term
Clumping/PCA
-
Low frequency

16

1.28

8

50%



-

10

-


Discussion and Conclusions

This paper focuses on ways to systematize Tech Mining. It emphasizes empirical
analyses

of
ST&
I text resources, with an eye toward automating as much of that work as makes sense. Note
that we see huge value in incorporating human expertise (ideally, multiple perspectives) together
with empirical findings to generate valid and useful intelligence (
i.e., CTI and FIP). However,
this paper concentrates on the empirical text mining.


Rating concerns (Appendix) are important. The Topic Modeling and Term Clumping/PCA
approaches are not identical. The former uses unigram tokens while the latter used NLP
-
extracted noun phrases. We recognized the issue of how to deal with countries, but did not
remove these as they could have been topically significant. Likewise, we did not separate out
methodological terms for special treatment. Some of these stood out

as sharp term sets (e.g.,
DEA). One could fairly decide in advance to apply special thesauri to remove such types of
terms if they were deemed not to be of prime interest.

References

Blei, D.,Ng, A., and Jordan, M.
(2003).

Latent Dirichlet allocation.
Journal of Machine Learning
Research
, 3, 993

1022
.

Boyack, K.W., Wylie, B.N., Davidson, G.S., and Johnson, D.K. (2000). Analysis of Patent
Databases Using VxInsight, Sandia National Laboratories. Albuquerque, NM, USA.

Deerwester, S., Dumals, S., Furnas, G.
, Landauer, T., and Harshman, R. (1990). Indexing by
Latent Semantic Analysis.
Journal of the American Society for Information Science
, 41, 391
-
407.

Griffiths, T., and Steyvers, M.

(2004). Finding Scientific Topics.
Proceedings of the National
Academy of
Sciences
, 101 (suppl. 1), 5228
-
5235
.

Newman, N.C., Porter, A.L., Newman, D., Courseault, C., and Bolan, S.D.

(to appear,
2012).

Comparing Methods to Extract Technical Content for Technological intelligence,
Portland
International Conference on Management o
f Engineering and Technology (PICMET)
.

Porter, A.L., and Cunningham,

S.W.(2005).

Tech Mining: Exploiting New Technologies for
Competitive Advantage
. New York: Wiley
.

Porter, A.L., Guo, Y.,
Huang, L., and Robinson, D.K.R.(2010).

Forecasting Innovation Path
ways:
The Case of Nano
-
enhanced Solar Cells,
ITICTI
-

International Conference on Technological
Innovation and Competitive Technical Intelligence,
Beijing, December
.

Porter, A.L., Schoeneck, D.J., and Anderson, T.R. (2012), PICMET Empirically: Tracking 14
Management of Technology Topics,
Portland International Conference on Management of
Engineering and Technology
, Vancouver, BC.

Porter, A.L., Watts, R.J., and
Anderson, T.R.(2005). Mining PICMET:
1997
-
2005 Papers Help
You Track Management of Technology Dev
elopments, Portland International Conference on
Management of Engineering and Technol
ogy (PICMET), Portland, OR
.

Porter
, A.L., Zhang, Y., Newman, N.C.(to appear, May, 2012).

Tech Mining to identify topical
emergence in Management of Technology,
Internation
al Conference on Innovative Methods for
Innovation Management and Policy (IM2012)
, Beijin
g
.

Robinson, D.K.R., Huang
, L., Guo, Y., and Porter, A.L.(to appear).

Forecasting Innovation
Pathways for New and Emerging Science & Technologies,
Technological
Forecasting & Social
Change
.




-

11

-

Appendix: The e
-
mail survey

Dear

We are comparing different techniques for extracting topic information from text. Could you please help us assess
what we’re finding for PICMET?

We have extracted topics from ~5000 PICMET titl
es and abstracts from (1997
-
2012) using three different methods.
In the attached Excel sheet we have combined the topics of the three methods (and randomized the order of rows).

Could you please score each topic on a 0
-
2 scale (0=not interesting, 1=in b
etween, 2=important).

No “quota,” but think of about 1/3 receiving each value.

For scoring, consider the following:

-

Does the topic get at important thrusts in the field?

-

Might the topic be useful for
analyses

of Management of Technology (MOT) research emphases over time?

-

Do not focus much on the specific wording, but rather on whether the terms suggest or indicate an important
MOT concept

-

If two topics overlap, don’t try to pick the best one, but judge each’s
value on its own (e.g., if “decision”
concepts are vital to MOT and two topics both get at decisions, each could potentially rate a “2”


or not).


We need results by SATURDAY, please


so hoping for very quick reaction judgments (do it all in <5 minutes!)
.

Thanks much!

Alan


---------


Rating issues expressed

in responding to the survey
:

I found myself considering some additional factors to your list.

1.

Are the lists mostly methodological in character?


In this instance I personally down
-
weighted the
importa
nce.

Methods are important, but probably not MOT specific nor conceptual in character.

2.

Are the lists partly strategic? In this instance I often down
-
weighted.


This is because there are two inter
-
related questions here.


Can tech mining identify importan
t thrusts? Should tech mining be used by
professionals in this instance to keep abreast of change?

Strategic issues are important, but tech mining may
not be the best available tool to support monitoring.


The topics are confusing. Some include
methodology, concept and country. It is not clear which is to be scored.

Most of the topics are important, only a few are "not interesting" because they just list a country, a few are in
-
between. I was not able to divide the three categories equally.


----
-----------------------------------------


Appendix Table A. Ratings of the 66 Term Groups

Ratings: 2= important to
analyze

MOT research; 1 = in between; 0 = not interesting

Groups: Topic Models [#]; Human
-
assisted PCA [a#]; Semi
-
automated PCA [h#; m#; l#
]

Rating

% of
records

Group

Factor Labels

Topical Term/Phrase Group

Mean
of 6

[of
5169]




2.00


6


emerging technologies, science, development,
commercialization, challenge, change, future, field,
nanotechnology

2.00


13


patent analysis, patent data,
patent application, analysis, data,
technological, field, information, intellectual, indicator

2.00

11.8

a2

dynamics

dynamics, diffusion, evolution, forecasting

2.00

0.9

l11

national R D
program

R D program, national R D program, R D performance,
evaluation system

2.00

0.5

l8

intellectual
property IP

IP, intellectual asset, intellectual Property IP



-

12

-

1.83


1


innovation capabilities, innovation model, innovation process,
firm, technological capabilities, knowledge, competitive,
business model,
business process, strategic

1.83


8


product development, product market, product design, product
process, development process, market, industry, manufacturing
companies, customer, companies

1.83


16


energy technologies, power, environmental, fuel cell,

sustainable development, green, development, solar cell,
technologies

1.83

22.2

a5

global issues

economy, sustainability, developing countries, world,
globalization

1.83

21.3

a9

knowledge mgt

knowledge, knowledge mgt, information technology

1.83

0.5

l14

RFID

RFID, radio frequency identification RFID, RFID technology

1.67

1.0

l1

bibliometric
analysis

system dynamics, bibliometric, system dynamic model,
bibliometric analysis

1.67


2


industry cluster, developing countries, economic policy,
development, government, innovation cluster, international

1.67


3


project management, project success, project manager,
management, success factor, organization, manager, risk factor,
critical

1.67


10


companies, business strategy, company,
management, strategy,
SME, enterprises

1.67


14


knowledge management, knowledge transfer, management,
program, learning, engineering, education, student, academic,
transfer

1.67


15


team member, organizational culture, social, organization,
culture,
member, individual, cultural, employees, relationship

1.67

21.5

a3

education

education, students, university, engineering, learning

1.67

11.0

a6

government

government, technology transfer

1.67

7.2

a8

IP (Intellectual
Property)

patent, IP

1.67

8.0

h3

new product

new product, innovation process, product developer,
developing new products

1.67

0.7

l12

New Product
Development
NPD

New Product Development NPD, NPD process

1.67

0.7

l4

ERP

integrated framework, ERP, managing resources

1.67

1.2

l5

federal
government

managing Change, corporate culture, technology diffusion,
shared vision, federal government

1.67

1.2

l6

fossil fuel

renewable energy, fossil fuel, global warming, biofuel, natural
gas

1.67

3.5

m2

convering
technology

converging, st Century,
Radical Innovation, biotechnology,
converging technology, policy implication

1.67

1.0

m4

ICT

ICT, communication technology ICT

1.67

3.3

m7

open
innovation

key factor, innovation activity, open innovation, innovation
capability, innovation strategy





1

See
www.theVantagePoint.com
. Development of VantagePoint was supported by the Defense Advanced Research
Projects Agency (DARPA), the U.S. Army Tank
-
automotive and Armaments Command, and the U.S. Army
Aviation and Missile Command

under Contract DAAH01
-
96
-
C
-
R169.