>> Christian Bird: Today we have the opportunity to have Emily Hill come in and visit us for a day and give us a talk. She did her PhD at Delaware with Lori Pollock and for the past year, almost year and a half she has been an associate professor at Montclair State, and she is here visiting us today to talk about her work on natural language programming and software engineering. Take it away. >> Emily Hill: All right. In general my research is motivated by the problem where you've got this huge source code base and someone's got to maintain it, and the poor

pogonotomygobbleAI and Robotics

Nov 15, 2013 (3 years and 11 months ago)

86 views

>> Christian Bird: Today we have the opportunity to have Emily Hill come in and visit
us for a day and give us a talk. She did her PhD at Delaware with Lori Pollock and for
the past year, almost year and a half she has been an associate professor at Mont
clair
State
,

and she is here visiting us today to talk about her work on natural language
programming and software engineering. Take

it

away.


>> Emily Hill: All right. In general my research is motivated by the problem where
you've got this huge source

code base and someone's got to maintain it
,

and the poor
maintenance
developer needs to somehow identify the code that they are looking for.
There are a couple of steps that they take in trying to locate that code. If they don't have
an expert available

to tell them where to look, then they have to do something else. One
way to locate relevant methods in fields is by searching the source code and trying to
look for the

big
regions of the code that might be relevant
,

and then further exploring
those area
s to see, to refine

our
understanding and really see what else is relevant to
exactly what the task that they are trying to solve.


Today what I am going to talk about is how we can use the natural language in the source
code, the words

and
the comments an
d the identifiers to help the developer search and
explore and understand their code more effectively. In fact, research has shown that
developers spend more time finding and understanding code

than
fixing bugs
,

so we can
help reduce the high cost of soft
ware maintenance if we can speed up this process. So
what are the current approaches that developers typically use in addressing these issues?

Well,
there are a wide variety of navigation aspiration tools
, it

and those are commonly
built into IDE
s using
the program structure like the AST, the call graph, the type
hierarchy and allows the developer to jump to related source code. These are techniques
that developers use all the time, and they are great.


They take advantage of the program structure, but

sometimes they can be predominantly
manual and slow for very large and scattered code bases
,

because each navigation step
has to be initiated and if your code takes multiple steps, every time you are locating a
new piece of code you are initiating navigat
ion step after navigation step to navigate that
program structure. So what is an alternative?

Well,
there are search tools which work
similar to how we search the internet using either Google or Bing
,

and they apply string
matching with these comments an
d identifiers
,

and so they do allow you to locate large
and scattered codes, but they tend to have a problem with returning many irrelevant
results and missing a lot of relevant ones
,

because if the developer enters a query and it
doesn't match the words y
ou original developer used in the source code, then the search
results are not going to return anything relevant.


So both tools have strengths, but both also have challenges. So how can we go about
improving these software maintenance tools to help facil
itate software maintenance? So
our observation is that programmers express concepts when writing code.

They use the
program structure if L statements, method calls, what algorithmic steps, the order they
organize their statements within their code, but a
lso the natural language, the words

and
the
comments in the identifiers.

So our approach

i
s

to

leverage both of these sources of
information to try to build more effective software engineering tools and our specific
target is software maintenance. So let

me give you an example of combining program
structure and natural language information together.


Let's say we have an auction sniping program. It will allow us to automatically bid on an
eBay auction online and we are looking for the code that implement
s adding an auction,
so the user

i
s going to add an auction to the system, and I happen to know from prior
experience with the system that DoAction is the method that handles all user triggered
events. If I am just using program structure I can see that D
oAction calls 40 methods.
That is not terrible, but only two of those 40 are relevant, so going through that list of 40
is a poor use of the developer’s time. If I use natural language alone and search the entire
code base, I get about 50 methods, 90 mat
ches across 50 methods
,

and I locate
d

relevant
two, but I also located tons of irrelevant ones.

But if we combine this information and
put it together
,

we can locate the two relevant ones with just one false positive so
narrowing our list of 40 or 50 meth
ods to just three for the developer to look through.


So we wanted to try to combine program language, program structure and natural
language to help us improve tools and get better information. Uh
-
huh, oh yes, please,
feel free to interrupt.


>>:

Was
th
at
an
intersection that was the programming language answers and the natural
language answers to get to…


>> Emily Hill: Yes. Basically we used search techniques of the natural language on the
program structure, so the subset, we only searched the 40
cal
lees

of DoAction. Good
question.


>>: When you say sort of natural language alone, are you talking about static analysis at
all or are you just saying just the common sections?


>> Emily Hill: So comments and identifiers, so any text that shows up.


>>:

But any sort of syntactical analysis, are you doing that also? When you say natural
language?


>> Emily Hill: Usually I mean bag of words at the base level, although we have been
working to build more semantic and syntactic analysis on top of that. I
will actually
show you what I mean by that [laughter] down the road. But strictly natural language
information
, what it boils down to is somehow using the words, whether that is just
straight lists of this set of words in a
method
, or if it's more advance
d than that. And
actually thank you, that leads me right to my next point, is that when using this natural
language information and combining it with program structure, it is not enough to use the
words alone independently. The context of

how the
words a
ppear is very important. For
example, we have three occurrences of the word map. So we have map object in the
method name where map is playing the

role
of a verb or an action
,

versus object map
which is like a hash map that contains objects, so that is r
eally its

noun
sense and then we
might have the words map an object just on two completely unrelated statements in the
method
,

but the word map shows up. So the context of how

that
word is appearing if it's
a query word is very important in proving our ac
curacy for the search
, as well is the
loc
ation of the word. For example,
a method signature is typically a better summary or
extraction of what a method is doing then a random word just anywhere in the method
body, and so we try to leverage that informati
on to help improve accuracy as well.


So let me show you an example of why using context in location is so important. So for
example
,

I like adding things so if we are searching for
add

item in a piece of shopping
cart software for example, on the left I
have a method
add

entry and on the right I have a
method called sum. So both are different senses of the word add. So when I talk about
context I am talking about going from lexical concepts, which is the individual word
itself commonly referred to as a
bag of words approach for information retrieval versus
phrasal concepts. So if we look at just
straight
word occurrences
,

both of these methods
contain the words
add and
item, both equally matched. But if we evolve that to phrasal
concepts
,

so

concepts
t
hat consist of multiple words, we can see that left
-
hand side
,

add

entry actually is adding an item, whereas sum is adding a price. So by taking advantage
of these phrasal concepts
,

we can better identify the relevant
add

entry method over the
irrelevant
sum.


In addition
,

location further helps us. So the phrasal concepts

in the signature concept for
add

entries,
add

item entry, which again contains the query words of add item, whereas
the signature in sum is simply sum and we could put in a direct objec
t there if we wanted.
But looking at the location
,

the method signature versus the body further helps us figure
out what the topic of this method, what action is it really trying to take. So I'm going to
talk a little bit about our work for query reformu
lation, which is where we help the
developer select query words and determine result relevance
,

which was the motivation
for our next step which

was
developing a model of word usage in software that actually
captures these phrasal concepts in a general way

that is usable by software engineering
tools besides just software search. And then how does that model with

phrasal
concepts
help us improve software search? H
ow can we take advantage of natural language and
program structure and program exploration
,

a
nd then if we combine these two pieces
together, how much can we lead to improvements.


Any questions before I change topics? Well, not really changing topics, just changing
problems slightly. With query reformulation we are concerned with helping the
de
veloper pick the right query words to help maximize the results of their search and
determine if the results are relevant or not. When developers search source code they
typically start off with a query that is executed by some search method on the source

code
base and then those results are returned to the developer. I am sure you have all
experienced this before, whether it is on the web or on source code base. And if those
results are relevant, the developer can stop their

search,
and if they are not
what they are
looking for they can continue to repeat this process until they either get relevant results or
they get so frustrated that they stop and walk away and they use some other means to
locate the code that they are looking for.


In this process th
e developer faces two key challenges. First in deciding what query
words they actually have to search for
,

and secondly in determining whether or not those
results are relevant. And I am going to go into detail as to why those are so challenging.
So fir
st why is selecting a query difficult? When we are searching software we have to
guess what words the original developer used to implement the concept
,

and actually
research has shown that two people when trying to select words to describe a familiar
conc
ept, only agree about 10 to 15% of the time. So this is a really, really common
problem, not just

in code
search, but in

searching
in general.


>>: Is that disagreement without talking to each other, or after?


>> Emily Hill: Without talking to each oth
er. So t
w
o people trying to describe, like
maybe they saw a picture and they are trying to pick words that describe

it
.


>>: And are those developers or just general people?


>> Emily Hill: I think that the target was developers for this case, although
I would have
to double check that; don't hold me to that. It may be more general information retrieval
research result there
,

because I don't think that that study has been done for developers.

Although, bigger
staff has done some work on how difficult i
t is to describe those
concepts.

So the three major challenges in selecting query words, first you can have
multiple words with the same meaning, so you might formulate the query delete, but that
concept is implemented as remove, rm or del as an abbreviat
ion. Then you might have a
single word that has multiple meanings, so add, as we saw in our prior example can mean
either appending or adding to a list versus summing, and those are two different senses
and you are going to get irrelevant results if you u
se that one general word. And even if
you pick the exactly right word to describe the concept you are looking for, let's say
going back to our auction sniping program example, let's say you want the code that
implements the data storing the

auctions
in th
e system. Auction is clearly the right word,
but it is an auction sniping program. The word auction is going to appear everywhere
throughout the code. So it is not a good discriminator. So even if the word is perfectly
correct and accurate, if it is to

frequently occurring, it is not going to be specific enough
to get you the results you are looking for.


So all three of these challenges really conspire to make it difficult to come up with a good
query, and it really makes it difficult for the search to
ol to try and suit any arbitrary query
anyone could think up all of the time. So it actually becomes very challenging. So that is
the first challenge we are trying to address.


The second challenge is why is determining result relevance so difficult?

We
ll,
typically
in a typical IDE when we do a keyword search or a regular expression search, typically
the results are presented just in a list with the relevant results. We have to read through
the code to find if the results are relevant or not. So the c
hallenge, if we think about
when we search the web, it is very easy to pull out where our query words appear, what
their context are. The query words are boldfaced in the context of the sentences w
h
ere

they are
used. The titles of the
web pages

are nice
and in a big

old

font; they're bigger;
they are in a different color
,

and so it's much easier when I enter a query into something
like Bing or Google, I can quickly see ah, was my query even right
,

before I even go
looking to answer whatever question I hav
e, the reason I made the search, I can quickly
figure out if my query is even in the ballpark. But with source code, the developer could
actually waste time trying to understand code that is not even relevant
,

and to me that is
like the biggest crime is t
hat understanding code is hard enough without having to
understand code just to figure out if your query was getting you the right source code.
Uh
-
huh?


>>: So

couldn't you do something similar to
what Google does?

They
highlight the
relevant words and
they show it in some context, right? Is there a reason why you
couldn't do that, or is that where you're going?


>> Emily Hill: That is kind of where I'm going and actually you could take that idea
further. I have only pushed a little bit in terms of we

are going to use phrases to embed
those query words and give that context
,

because it is easier to

read
a natural language
phrase then some source code. But you could even go further and highlight even more.
Uh
-
huh?


>>: Are there studies to show how l
ong this takes to scan through a list of search results
like the old
-
style?


>> Emily Hill: I don't think so. And I think it's highly dependent on how expert you are
and how familiar you are with the code system. So we normally assume that the person
se
arching has very little familiarity with the code system
,

and so they are going to take
the longest. If you know the code base, you're probably going to be really, get pretty
good at filtering out irrelevant results, but a newcomer to a system that is rea
lly
unfamiliar, they are probably going to have to read each result and it depends o
n how fast
you read, how quick

you are, but no, I am not aware of any studies that have evaluated

that
.


>>: [inaudible] under know if it is in any of their papers, but it

can take like multiple
minutes per and people all typically like give up after five or
ten,

like they seem
irrelevant going down the list might not even go down the list.


>>: Like Google, you give up after the first three, or one.


>> Emily Hill: Yeah,

5 to 10.


>>: I mean they don't scroll.


>> Emily Hill: Yeah, I think

for
information retrieval in general, the average is 5 to 10.
They will look at the first 5
,
10 results and if they don't see it they will give up. But in
that code list, it is just

alphabetically in most cases
,

and so if

in your alphabetical
listing
of file names it doesn't show up right away, you might have the right query and just not
know it. If we could really get the developer to figure out is
their

query even right and
then h
one in on the correct results, that is our goal. So the problem is that search results
in general can't be quickly scanned, which we are going to try to change
,

and

the results
are poorly organized so you have to decide

the

relevance of each result. You
have 50
matches; you might look at the first 5 to 10. If you are really, really exhaustive, you
might look at all of them, but you have to keep determining the relevance and making
that decision for each search result. So we would like to change that. S
o our key insight
is that the context of the query words in the source code is going to enable skimming,
organization of results and provide faster feedback for poor queries.

We don't claim that
we can automatically correct the developer

s query; only the
y understand their
information, but if we can give them that feedback faster so they can more quickly
changed

their
query that is a win for us. So we are going to automatically capture our
context by generating phrases from the source code.


For example,
if I had the signature add item, I could generate a phrase add new book
item, for example. Or update event, compare playlist file to object
or load

history. So for
example if our query is load, we can quickly see that this result is loading history versu
s
loading
a

file, downloading a file, delivering a payload, so just by seeing how the query
word
appears with other words in the signature
,

will help us make that determination
more quickly. And we try to make it faster read because usually humans can rea
d natural
language sentences faster than source code.


We are going to organize

these
phrases into a hierarchy and I will show you an example
as to how we do that. If we take an example task, let's say we are searching a JavaScript
interpreter for a signe
d integer conversion. So our query is to int and there are 30 results
for to int which I have listed to the right. And we could look through this entire list or if
we could use our phrases to try to group them together, we might be able to hone in on
the

relevant results faster. So the phrase hierarchy at the very top is the query to int 30
and below that with
have
three sub phrases, add value to code int 16, object to int map
and to int 32. And since signed integer conversion involves the 32
-
bit, to in
t 32 is the sub
phrase that we are really interested in. That is where we think we will find

our

relevant
results.


We

are able to discard about 30, 27 of the other results
,

and the context of the query
words in the phrase helps us to determine the releva
nce more quickly. So we are
reducing the number of relevance decisions from 30 results down to just three phrases
and then three results to verify that those three results are the

correct relevant signatures
.


>>: [inaudible] generate phrases?


>> Emily
Hill: Yes
. S
o we automatically generate those phrases from the signatures and
use a partial opportunistic phrase matching that greedily groups them together into a
hierarchy. Uh
-
huh?


>>: Are those phrases every identifier from the signature of the met
hod that you are
considering, or is it just a subset?


>> Emily Hill: We usually generate them for the entire signature
,

and I think in this
iteration we also generated it for the parameters t
o
o. But we tried to match the longest
sub phrase so that usual
ly prevented us from grouping based on like formal parameter
names unless that provided the largest grouping. Does that make sense?


>>: Partially
,

but just keep going.


>> Emily Hill: Okay. Yes we do program for the parameters but usually they are
gro
uped based on their signatures and so typically they are based, for example to int, you
notice we can split up to

and

int; they don't have to be right next to each other, like add
value to code int

16, things like that.


>>: So are these phrases represent
ative generally is just words or do you have some sort
of model, semantic model to work from?


>> Emily Hill: Thank you. That is exactly where we are going. We started off with
strict phrases
,

and then we recognized the potential if we could build that
general model,
any software engineering tool could use it, so that is our ultimate goal. And that is
actually our next section, so almost there. [laughter]. And just to give you a sense of
how we generate the phrases because this was kind of our startin
g point for building the
model, is that fields and constructors we naïvely assume
d

that they were all

noun
phrases.
T
hey didn't involve actions, so filewriter,reportdisplaypanel and then we assumed that
method names were verb phrases, that they started wi
th a verb and they had an optional
object after them. So verb phrases consist of a verb followed by a direct object and an
optional preposition and an indirect object, and if you have forgotten your grammar
before I did this research, I didn't remember th
e difference. If we take an example phrase
like add item to list,
add

is the verb and then to is a preposition and then item is a direct
object and list is the indirect object. So we always look for a verb and a direct object and
if there is a prepositio
n in the method name then we go hunting for the indirect object.
Our real challenge was identifying the direct and indirect objects of the verb and we
typically look first in the name. That is obviously the best indicator. And if it wasn't
there that we

looked at the first formal parameter and then at the class name.

So for example, get connection type, run I report compiler, update event, or compare
playlist file to object. Uh
-
huh?


>>:

Did you
find any computer science or programmer specific idioms t
hat you needed
to also

heavily

mime? I see a lot of code that says X digit to y for transformations from
X to Y.


>> Emily Hill: Yeah, so we mostly avoided to, although I did have a version of an
identifier splitter that preprocessed to and tried to use
it and make

it
convert. So we did
some work with idioms like convert, especially if it starts with the preposition to, to
string; that's converting something to a string.


>>: What about the digit 2?


>> Emily Hill: No, I know. If I handled it, it was
during the identifier splitting phase
where I could try to detect that, but in general if it started like with
a
TO preposition, we
might infer convert, and there were a couple of cases, but again it is all how much time
do you want to spend in doing that?

And so for query
reformulation
, just generating
these phrases, we didn't really need that level of detail. It still

worked
pretty well. But as
we go to the more general model, we have to spend more and more time in making that
more accurate and doing t
hat parsing. And I will show you how we go about doing that
in general.


So to evaluate
,

I called our query formulation technique contextual search because it uses
the context of the query words and to evaluate it, we compared it with an existing
techniqu
e called verb direct object which is very similar to our technique except that it is
only the

verb and the direct object. It doesn't consider any general noun phrase method
names or prepositional phrases. And we compared search results from 22 developers

on
28 maintenance tasks; they were searching for 28 concerns, or search tasks. And here we
have box plots for the comparison between contextual search which I called context and
Verb DO

on the right and

they are
box plot
s
, so the middle shaded box repres
ents the
middle 50% of the data. The horizontal line is the median and the plus is the mean and
Xs represent outliers. So we compared these two techniques in terms of effort which we
measured using the number of queries the user entered. Ideally
,

we wou
ld've liked to
have
measured effort in terms of time, but we didn't want to tell our subjects that they
were being timed and some of them ate during one half of the experiment but not in the
other, and so unfortunately all w
e have is the number of queries,

a
nd
also
in terms of
effectiveness using the common information retrieval measure of F measure which
combines precision and recall. And we could see that contextual search requires less
effort than
Verb DO

and returned more effective results, which furth
er justifies because
contextual search significantly outperforms
Verb DO

it justifies going down this path

that
the more accurate we make our information, instead of just stopping with verbs

and direct
objects and really trying to model noun phrases and pr
epositional phrases, we can
actually get significant improvements. Is that a hand?


>>: The measurements

were
the comparisons hold true for 10 subjects, for every
subject?


>> Emily Hill: Yes, because that is how we ran it. We did it paired; we ran bot
h ways.
We did the two sample t
-
test as well as the paired, because it was kind of a mixed model
result, but yes, it held for both of them.

Although a lot of the subjects like
--
what
Verb
DO

did that contextual search didn't
,

was that it also did co
-
occur
ring pairs, so if you
entered
--
your query had to be a verb followed by a direct object. But if you entered a
verb, it would list all of the other co
-
occurring direct objects. And if you entered a direct
object, it would list all other co
-
occurring verbs,

and the subjects did like seeing what
other words co
-
occur
red

with

their
query words, so they did really like that. But it was so
limited because it only matched using verb and direct object. It couldn't, there were some
search tasks that they could not

formulate queries for and that is partly what led to it. A
combination ultimately would be ideal and we are actually still working on trying to take
that to the next level. So any other questions about that before I move on?


So as you mentioned before,

we started getting inspired by these phrases and thinking
gosh, what else could we do with them? And another student at the time actually wanted
to work on automatically generating comments
,

and we thought if we could really turn
these phrases into which

it a generalized model of semantics of the program structure and
the natural language in the underlying source code, it could be used in almost any
software engineering tool that uses textual information. And so the challenge was well
how do we go from p
hrases to a generalized model that more people can take advantage
of
.


So with query reformulation our phrases capture noun phrase and verb phrase phrasal
concepts for methods and fields. So for example
,

convert result, load history,
synchronized list. B
ut we needed to generalize that model from a textual representation
with phrases to a model of this phrasal structure so these could be annotated with their
different roles in the natural language.

And we also needed to improve the accuracy. For
example,

if I am going from a field signature to a phrase
,

I could actually mistakenly
label a verb as a noun and

the
phrase would still come out readable and correct. But
when
we
want to internally

represent it
as a phrasal concept, we have to have a lot higher
accuracy. So our goal is to represent the conceptual knowledge of the programmer as
expressed in both the program structure and the natural language through these phrasal
concepts. So any piece
--
we are trying to

provide
a generalized model that can be us
ed in
automated tools to

represents
or

encodes
what a human sees when they read code. That's
our goal, where we are trying to get to.


So this is an overview of our software or user model which I will call
SWUM,

and it
consists of three layers. The top l
ayer is the program model which any program analysis,

any
program structure you have used be
fore would fall into that layer;

ASTs, call graphs
type hierarchies, that's the traditional analysis layer. At the bottom there is a word layer,
so each word indiv
idually, and that is what has typically been used by textual analysis
techniques in the past, that so
-
called bag of words model. So our real insight, our
contribution is this interior, middle layer
SWUM

core which models the phrasal concepts
and that is w
here we do the parsing of the words into verb phrases and noun phrases and
start annotating them with action and theme. Now at this level I am switching to the
words action and theme from

verb and
direct object because
Verb DO

are syntactic layer
informat
ion whereas action

and

theme

are
more semantic, higher
-
level concepts. But for
all intents and purposes
,

you can think of them as

verb
and direct objects and you won't
be far off.


So we have three different types of nodes, one for each layer, program ele
ment nodes,
word nodes, and then phrase structure

nodes
which represent the phrasal concepts.


And
in terms of edges, within each layer we have edges. At the top we have structural edges.
In the middle we have

parse edges
. At the bottom we have word edg
es, so we can
represent things, for example, we can do synonyms or stems if you want to know that
adding is the same as add, you could put that kind of word relationship in the bottom
layer. And
in
between the layers we have the bridge edges which allow u
s to go from the
program structure to the phrase structure so you can navigate and take advantage of all of
the information of the AST and call graph
,

as well as all of the semantic information
between the parses and the phrasal concepts.


So we are really

trying to provide integrated solution
s

so that people don't, tool
developers don't have to understand all of the parsing details, but they can still leverage
textual information in their software engineering tools. And so our goal is that if we had
a mod
el like this we could provide an interface between people who want to use textual
information and people who are
working on
improving the accuracy of the parsing layer,
similar to
how
the PDG became an interface for researchers and developers using
program

analyses. So that is our ultimate goal. It might not be
SWUM
, could be
something similar, but that is what we are working towards.


So what are some of the challenges in automatically constructing such a model?

Well,
first we have to accurately identif
y the part of speech. This is a well understood problem
for natural language, but in the
sub domain

of software
,

it becomes even more
challenging. So for example, the same word might have multiple parts of speech, and
actually I really like the example f
ire because in natural language
it
is typically a noun.
You see fire and you try to put it out.

But
in source code fire is often a verb; it can be a
noun modifier like an adjective, or it can be a noun if it is in the gaming system. And so
for every wor
d in an identifier we have to somehow identify some kind of part of speech
for
it
if we want to accurately parse the identifier names. So our approach is to use both
the position of the word in the identifier and its location. Is it in the field, is in a

method,
is it

in a
constructor to help us try

to
disambiguate what part of speech that word is.


After we have identified the parts of speech
,

then we parse them by identifying the
action, theme and secondary arguments for any method verb phrases that we
have. Noun
phrases are very simple. We don't go beyond noun modifiers and nouns so we don't
differentiate between adjectives or nouns that have become adjectives or things like that.
But verb phrases and identifying these themes and secondary arguments,

that is where the
challenge is. For example, we have a reactive method action performed which doesn't
tell us much

about what the method is doing, s
o that we don't ha
ve a very good solution
for yet, h
andle action performed. Tear down set groups test, co
nvert restriction to a
minimum cardinality or at auction entry. And what we've done in phrase generation, we
just generated all of the phrases
, s
o we would generate add entry, add auction entry, we
just would generate them all. But in building this model
, we tried to take a step back so
we can present as much and preserve as much information as possible for the
e
nd tool
because we don't know exactly what that tool is going to be. So now what we do is we

say
the action is

add and
there are two themes, ent
ry and auction entry and

those
are
equivalent
. They describe the same thing, s
o we would figure out where, if there is a
direct object in the name,

does it overlap a parameter? Do
the head words, the last words
to the right of the phrase, do they overlap
? And so we would identify that those are
equivalent. Uh
-
huh? Do you want me to go
back?

Ask at the end? [laughter] okay.


How do we go about developing these
SWUM

construction rules? Our research process
is to analyze how developers actually use wor
ds in code. And so the concept behind any
machine learning or natural language technique is that if a human can recognize it, we
can train some automatic tool to recognize it. But you have to be careful of cost
-
benefit
analysis. Sure I can recognize any
thing a human can but how long is it going to take me
to develop those rules? So we have been highly motivated by our target so
ftware
engineering applications;

query reformulation required the least analysis. It still works
really well. We generated rea
lly readable phrases with very, with not as accurate rules
and then for search I didn't need to be quite as accurate as we needed to be for comment
generation.

When we
are actually genera
ting text for human consumption a
t summarizes
method we had to be ev
en more accurate. So we have been refining our rule
identification process to be more and more accurate, each iteration with each

new tool we
are targeting.


So I started with 9000 open source Java programs, because they are available. That is
what I had

on hand. And we will start with those identifier names and try to classify each
name into a partition. The first easiest way is to classify them into method names and
field names. And then I will analyze each partition and evaluate the accuracy of our
current approach on a random subset. For example, we could start and assume that every
method name starts with a verb
,

and in fact that is where we started with phrase
generation for query reformulation is that we would assumed every method name did
start

with a verb. And we look at our random subset and we can see that that is true for
the first three methods, but for size and length, those are actually getters with noun
phrases, noun beginnings. To string and
next

start with prepositions and synchroniz
e
d
list

actually starts with an adjective. So our next challenge is to refine our approach in
our classification. First we need to find which partitions are missing. That is usually the
easy part. But then we have to figure out how to automatically ide
ntify and categorize
these method signatures into those partitions. And we would continue repeating this
process
on a random sample
until we were happy with the level of the accuracy for our
target engineering software application. So as we keep evolving

this representation over
time, we are working to improve the accuracy more and more.


So we have this model,

but
how expensive is it? [l
aughter]. Is it going to scale to
really,
really big software? In terms of space if you build the entire model, it c
ontains a node for
every identifier and every unique word that is used. And the number of edges is linear
with respect to the number of words within those identifiers and whatever structure or
word information is included in the model. So that may be ver
y dependent on your target
software engineering application software based on how much program structure
information you need. Do you just need the AST, or do you need more than that? In
terms of time, it can be constructed incrementally, built increment
ally and constructed
on
-
demand, so that helps limit the costs. I created an unoptimized research prototype and
to give you a sense for how long that took, I analyzed signatures for
a
74,000 line of code
program in 11 seconds and 1.5 million lines of code
in 11 minutes. So we consider that
to be reasonable for most of the codes that we are looking at, but I don't think they are
quite as large as what you guys might be looking at. [laughter]. So that would definitely
be something to consider.


And there a
re some optimizations that can be done. First
,

you can optimize by the level
of program structure and accuracy that you need. For example
,

for query reformulation I
didn't need the level of accuracy that I needed for searching. So some optimizations can

be improved that way. And it can also be constructed once and used in many software
engineering tools. So if you wanted to commit to this kind of representation for a wide
variety of software engineering tools, it would make more sense to use the expens
ive
analysis because you would get to reuse it over and over again across different software
engineering tools. And because

it
can be

built incrementally
, it can be updated
incrementally overnight, so you just have the one cost up front, the first big
-
tim
e batch
and then you could incrementally update

it

as the code evolves.


So what other software engineering tools can

it

be used in? So far we have applied it to
source code search, also known as concern location. As well as program comprehension
and dev
elopment, we have applied it to automatically generating comments to summarize
what a method is doing.

It
could also be used for automatic documentation of program
changes, automatic recommendation of API methods, a novice programming tutor,
anywhere you
could use text to help solve a software engineering problem, you can take
advantage of this kind of analysis. In terms of traceability, linking software artifacts
together, external documentation, e
-
mails, bug reports to the source code. That involves
ge
tting

a
representation that is similar to
SWUM

for

those
natural language artifact
s
. In

theory that

is the easier problem because analysis tools exist for natural language text
general, although they have to be probably tweaked for certain types of softwa
re artifacts.
We can also work on building more intuitive natural language

based

interfaces, for
example from debugging the why
-
line interface by Ko and Myers, they were asking
questions about the program execution. They were pre
-
canned, preprogrammed
in
. We
might be able to allow the user to ask more informative questions. They could initiate
rather than just having a list of questions, possibly. And also how it detects mining of
software repositories, for example we can use this kind of representatio
n automatically
build a WordNet for software synonyms by looking at verbs that are in the method
signature as well is in the body. And also to continue improving our
SWUM

construction
roles, so we can use

SWUM
to help improve
SWUM

in the future and make i
t more
accurate. But anywhere you could use text to solve a software engineering problem, that
is really where this could be

used as long as it is worth it,

as long as this is adding
something, adding value, adding accuracy. So any questions about the ge
neral model
before I show its improvement in something like search?

Yes?


>>: When you were trying to distinguish between

add
entry and whether it is an
[inaudible] entry or just
add

entry, have you considered also looking at the call sites to
see what t
he variable is that they, the variable name of the thing that got passed in as the
argument to that method?

So
another name for that [inaudible].


>> Emily Hill: Yes, so I was just demonstrating the signature level analysis, but yes,
when we actually ana
lyze a method call, we take into account both the formal, the actual,
its type of variable; we have four sources of information, the variable's name and type for
both the actual and the formal. And we may have an additional source of information if
the me
thod call as a whole is nested inside another method, that is like the formal
parameter for whatever it is a parameter for is also summary related. So yes
,

we do chain
them together when we get to the, within the method body analysis, we do chain

those al
l
altogether, to extract as much, every last drop of information we can.


>>: I guess two questions. It seems like this is specific to the natural language

being
used. I suspect that a large majority of code uses like English

identifiers and [inaudible]

but how difficult would it be if you working on a German CodeBase or Chinese or
whatever
,

do you have any notion of how prevalent that is
? I mean we see

open source
code that is written in a different language?


>> Emily Hill: Yeah, 99,000 programs cont
ain German and French and Spanish and
Italian. Not a lot, but it's there; it is clearly there. [laughter].


>>: [inaudible] change your technique [inaudible] different languages so that the
structure could be different?


>> Emily Hill: If they structur
e their identifiers differently, so the challenge

i
s if they are
used to writing English and they just start writing in another language, they might
actually still follow English naming convention patterns just with different words. That
is really simple
to address. But if they are actually changing the structure of

how they
name things, like Germans can have

a
different

phrase
structure than English does and if
they don't start their method names with verbs anymore, then you have to completely
develop a
new part of speech analysis for that. So it is challenging when it, if it's not just
the substitution. If things are still kind of in the same positions and they follow similar
naming conventions, just different words, that's just a new dictionary. That

is easy. But
if they actually reorder it…


>>: So this would be like an off
-
the
-
shelf classifier, like what is a noun, what is a verb…


>> Emily Hill: Right. And there are a lot of them that exist for other like natural
languages, and it's just a matt
er of tailoring them. The same or similar techniques to what
we've used to specialize them for software would work there, but

you
need some sense of
the naming conventions used. I think really the big limitation of this is that it is based on
naming conv
entions and if you change those significantly, whether it's another language,
natural language or another programming language, you're going to have to do a lot more
work. If you are going from this is mostly done in Java, if you're going from other objec
t
oriented languages, like

people use
C++, there are many similarities, but you have to just
reverify them, make sure that they are still following the same naming conventions and
that would apply whether you are looking at a natural or a program change.
Uh
-
huh?


>>: Do you find that this information is just not very useful? Like names reportedly
chosen?


>> Emily Hill: So for scientific software all bets are off, like predominantly highly
parallel codes, scientific codes where the variable

names
are al
l XYZ
, ABC, this is not
going to work well. We know that. It is kind of a
sub domain

that we are analyzing
separately, because it is separate challenges. So we predominantly looked at open source
codes or

typically GUI
applications. They have user inte
rfaces. They have features that
are typically well named because they are open source and they have to use the source
code as a communication mechanism between the developers. Other places where it
doesn't work well are what we call reactive method names
,

like API method names. Like
if you are overriding an interface, you didn't get any choice in selecting that method
name. So we have to really rely on the method bodies to build the semantic model, or
generate the summaries for a common generation for e
xample. So that is, but as long as
inside that API method, as long as you have implemented some meaningful words, then
we can still use it.


>>: But you are saying that you also do look at the program structure within a function
that the actual statement
s…


>> Emily Hill: Yeah, depending on which problem we are solving. For search

I
haven't

gone
to that level because it's too expensive, but for content generation we have to,
because we are trying to generate a summary of a method automatically.

But yes
, we do
have mechanisms for analyzing and trying to summarize these sub statements [inaudible]
analysis for loops, for if statements for blocks of statements to summarize what they are
doing in generally summarizing that action. And so the same concepts c
an be used to
automatically debug method names by looking at what the inside

i
s. Does it match what
the method name itself is, like a setter that doesn't set anything. You know, that is an
example of things that we can attack using this mechanism. Any o
ther questions before I
move on?


I ran out of water. So now my target application that I have been mostly interested in
using this model is to improve search. Can we make search more accurate for software?
And really I am most concerned with improving
the precision, and so that is where the

phrasal
concepts
come in
. So this is a specific example of

SWUM
to give you a better
understanding of how we are using it. So in the top left I have a very small snippet of
code, so it is main object.Java. The met
hod is called handle fatal error and it has one line
of code, syslogger.
do
print and it is printing an error. The program structure
representation of that method call in the body is syslogger, so

the method

do print is
invoked on the expression syslogger
and it has an actual parameter of error and that maps
to the phrase structure all the way to the right. I have gone ahead and put the word nodes
right into the phrase structure layer. That is usually how I think about it, but technically
these can be thr
ee separate layers and it helps with the optimization. But for readability I
have put them all
up here
. So the gray nodes are that phrase structure
nodes
. So we have
the verb phrase
s
, prepositional phrase and a noun phrase. The white nodes are the word

nodes.


For search what we use

are

these different semantic roles. We have an action, do print.
We have a theme or a direct object, error. Our secondary argument is to sys logger. In
this case we have inferred the preposition to and we have some

rules

to do that, but it is
not general. It is just that there are some specific ones that we can look for. And we also
have an auxiliary argument, if

we have additional formal parameters. So for example,
error is our theme; we might find that that is equiva
lent to the error in the formal
parameter. So we can have additional auxiliary arguments, especially if there is a whole
list of additional formal parameters, any of them that is not Boolean is usually added to
the auxiliary argument list unless it starts

with a verb that we know typically has Boolean
arguments. But I am getting into low
-
level details there. So the really important thing is
that we have these different semantic roles. Action, theme, secondary argument if there
is some kind of prepositio
n involved
,

and any remaining auxiliary arguments so that we
can throw all of the information from the signature, all the information we can find into
one of the semantic roles, and we take that into account in cal
culating our relevance
score.


We also tak
e into account the head distance, which is the location within the phrase
structure. So in natural language phrases there is this concept that the word all the way to
the right in the phrase, the last word in the phrase is the head

word
and it is really t
he
theme of that phrase. So for example, we have the phrase s
yslogger; it is less about sis
or
system and more about logger because logger is in the position of the head. So logger
would be labeled as head and sys would be labeled as one away from the he
ad. And so
we also use that head distance because of a query word appears in the head position
, that
method or that phrase is more likely to be relevant to the query in that case.


So the different source of information we use as I just mentioned we use t
he semantic
role and we assume that query word occurrences in the action in the theme are more
relevant than occurrences in other argument roles. That is inspired by the verb direct
object approach that was used before. And we also take into account the
head distance
which that is a new aspect that has not been involved in software search before. That's
the closest the query word is to this head position, the more strongly the phrase relates to
the query word. So for example
,

in our auction example, spe
cial auction has more to do
with auction than auction server because auction server is really about a server,
which
happens to hold auctions, wh
ereas a special auction is actually an
auction
.


The idea is to be greedy so that we have diminishing head dista
nce so that as long as the
word appears somewhere in the phrase, it comes up

as
relevant. We have chosen the
score so that if it always appears in the head position that obviously hit first and later
down on the list we will have other occurrences of the
query words just in case, to be
greedy,
if
the query word never appeared in the head position. So we try to do a best
effort.

And
additional information we use

i
s the location, query words appearing in the
signature
,

we believe more strongly indicate rel
evance than appearances in the body.
And with traditional information retrieval techniques, they typically use inverse
document frequency to approximate usage in the rest of the program so that frequently
occurring words throughout the entire program typi
cally aren't good discriminators, and
so we inversely weight their contribution to the score using IDF. How's that? Okay?


>>: In that

one
you make, do you segment the difference between the left signature in
the body, because if you have printout, it i
s going to frequent [inaudible] lots of bodies
but as a method signature, there is only one.


>> Emily Hill: We segment it just based on identifier splitting and whether or not we are
using stemming. So we just

split
all the words and we use that as the
IDF. We haven't
done a location
-
based IDF, although that would be an interesting thing to try. The
problem is that we don't know what the user is searching for. Do they want just the
signatures or not? And so that is the challenge, is figuring out how
does the user specify?
Did they know that they are looking in a certain role, and if they have that information,
certainly we can take advantage of it. But I think that is a challenge as to why we haven't
done it yet. Anymore, okay?


>>: So this is gre
at, but it is very different from a browser search. If users are used to
doing it one way
,

how can you wake them up and say hey, we do things differently but
it's better, have you thought about that?


>> Emily Hill: Well
,

the idea is that we want to make

the query mechanism as simple as
possible. We want the query mechanism to be a short 2 to 3 word phrase the same way
you would search on the internet. That is our goal and that is why we are jumping
through all these hoops to try and make a short query
be effective
, b
ecause really the
search problems are very different. When you are searching the web you have
information

and

you probably have a question and as soon as you get one webpage that is
relevant that answers your question, you are done. But wh
en I am searching code for
maintenance purposes, I need every relevant occurrence. I am not satisfied with just one
relevant result; I need all of the relevant results. And so that is why we are working so
hard to really try to get precise and then we br
ing in program expiration techniques to
improve the recall. Right now we are searching over so many different methods, how
can we find the ones that are the most relevant to the query
,

and then can we refine thos
e
further to improve the recall;

that is ki
nd of our approach. Uh
-
huh?


>>: It seems like you're operating with the constraint that a query is like a sequence of
words. By providing some summary, you are allowing them to, oh, I am looking for a
signature, or I am looking for something, but could
n't you rather than displaying
everything so they can filter, allow them to filter preemptively by just saying when you
query, instead of just providing just words, also here are some things I care about like I
only care about methods, or I only care about

a class, or providing some additional
information in the query instead of trying to provide

it

in the summary later on, does that
make sense?


>> Emily Hill: Definitely. You can definitely, the more information they can give us,
we just don't want to en
force that. We want to allow the ability
--
the holy Grail for me
has been I should be able to search for my source code as easily as I search the web with
Google or Bing. But as we refine this and try to better meet developer needs, I think we
are going t
o find that we are going to have to add things like that into that. But so far we
are just trying to make a general solution, how far can we push it? How accurate can we
get? But it is really hard to make a general solution that works well, because ther
e are so
many different types of information needs
and
so many different reasons a developer
might be searching. It is hard to be all things to everyone, so I think our next steps are
further specializing. Yes?


>>: [inaudible] searching do you frequent
ly have this [inaudible] page optimization
[inaudible]? If you would change [inaudible] identifiers [inaudible] how would you
change? Like what would be, what would make it easier for your approach?


>> Emily Hill: Oh right. So you could, based on the
rules that we have learned, we can
provide guidelines to developers that if you write your code and follow these patterns we
are going to be better able to find it, definitely. So what we have tried to do is use
naming conventions and patterns that develo
pers use over a wide variety of source code,
but especially if there are company mandated naming conventions and you follow those,
we can increase the rules and the accuracy a lot. So definitely, if developers can have
that information
, it would
definitel
y help us improve our accuracy, certainly. Although
we have made our problem harder by assuming that we don't have that luxury and trying
to still be successful. How far can we push it? How accurate can we get? I really think
that the accuracy is
still

only around 70% F measure
,

because there is a limitation to
using the

words
alone because sometimes there are going to be methods that just don't
contain any relevant words and that is a challenge. There is like a bar and we are just
trying to see can we

reach that bar and then how do we keep going on beyond it. Uh
-
huh?


>>: [inaudible] methods and relevant words, what do you do for abbreviations?


>> Emily Hill: I have a technique for abbreviation expansion, but it is not quite accurate
enough yet tha
t I

haven't thrown
it in here. But that is partly why we have pushed the
query

reformulation
technique, so the developer can more quickly explore how it is
actually implemented
,

so that if they wanted to use both the abbreviation and the full
form, they c
ould add that in, but by seeing what the words are used for. Right now we
are not taking that into account. There is certainly more room for synonyms,
abbreviations all of those things, but right now we are just strictly going off the words
themselves.
Uh
-
huh?


>>: Is

there

anyway that you could leverage developers to help you with this task so that
if you know that my blind spots, here are my methods

that I just can't reason about?

Could you say okay, you get an hour of a developer

s time to
annotate?

Like
,

I don't
know these abbreviations. I can't expand
them

or something like that. Have you thought
about
--
because people aren't going to annotate everything, but sometimes if you can use
people's time really effectively and there is a payoff later,…


>> Emily Hill: No definitely, we haven't really thought about that, but that is a really
good idea if we could get developers to do that. A lot of this unfortunately we do
ourselves and so we are relying on our analysis.


>>: [inaudible] warning like t
his isn't very well named.


>> Emily Hill: Exactly.


>>: If you know [inaudible] that's all right. I have seen where it actually says this is
named badly, fix this.


>> Emily Hill: Exactly. And yeah
,

if we could integrate that idea and collect that
in
formation then we could really help improve our tools, definitely. Any information is
helpful.


>>: So one additional source of information that I know has been crucial for web search
is the notion of a static ranking page, like what is the prior relevan
ce of this piece of
information. And it feels like you could incorporate that same sort of information here,
like maybe if a piece of code has a lot of callers or callees, like is sort of a [inaudible]
authority in the callee graph [inaudible] greater rel
evance, if it spends more execution
time inside that piece of code, maybe it is more important. Maybe it is closer to the main
function that is more important. It feels like there is a bunch of sort of prior signals about
the relevance of the piece of co
de that could not only be used to help relevance but also
identify where you get the most bang for the buck if you're going to ask your developers
to spend a little more time on things. Have you put any time in this prior?


>> Emily Hill: No, we haven't
used any relevance feedback yet, although there are some
techniques that have, if you use the hub
and authority type of mechanism, a
lthough it was
counterintuitive and they had to actually turn it around. It was like the hubs were not the
places you wante
d to go because they were so interconnected. That means they are so
general they are not useful, but they have taken

that

into account. So we focus

purely on

how much can we get from the structure and the words, but actually adding in some kind
of hub an
d authority would really be helpful I think if we could use it to accurately
identify it. Because obviously getters and setters, low
-
level methods, we don't want
those. We probably don't want ones that are too high either. You kind of want ones that
are

in the middle
,

and I think you could use call graph information to help, definitely.
We haven't gone that step yet, but definitely we could totally
--
any information you've
got, we could put into it and further increase the accuracy. I have just been foc
used on
how far can we push the words themselves and then once we get there and figure out
what that barrier is, keep going. So I see another hand.


>>: What about presenting the search results in a more graphical structural way like
maybe as you build u
p this model all these functions you have sort of a functional model
the whole lab, and it would be interesting to view the search results in the context of like
a graph, or a call graph.


>> Emily Hill: Definitely, in fact I personally really like seeing

results in a call graph
format, and that is part of the reason why we have worked towards integrating search and
exploration because that allows us to present it in a more graphical way and you just get
more of

a
context. That is my personal feeling; I d
on't know what developers in general
want to see and that we would
have to undertake a study to see how do peopl
e want to
see it. And in an in
formal study of a handful of developers, we found that depending on
what they were using it for, they really want
ed a map where they could zoom in and out.
So presenting the results in a format where they could possibly zoom out and get more
context or zoom in which I think you guys have done work on [laughter]. But we have
not actually gone that far yet. We are w
orking on can we automatically restrict that graph
so that we are not overwhelming them with information using these search and expiration
tools. But how these results are represented, so far all we have really contributed there is
query

reformulation
and

that phrase hierarchy, but that is definitely not we want to stay.
We want to keep evolving it, but we need to study what developers really want to see
first, unless we can leverage what some other people have studied [laughter]. Other
questions on this
?


Okay. I can show you some results of what we have done. We evaluated our
SWUM
-
based search technique with some existing search techniques. So there is ELex, which is
Eclipse's
regular expression search. It is similar to GRAP. We also used Google
De
sktop
search which has been integrated into
Eclipse
; that is called GES. And then we
also have FindConcept which is really where we started from. That was the inspiration
for our approach. And it is similar to the Verb DO approach that we used before ex
cept
that it also uses synonyms in the query reformulation. So
FindConcept

given a

verb
direct
object query, it searches for
Verb DO

pairs in comments and methods signatures
and allows the

user to do query reformulations
with synonyms and co
-
occurring ver
bs
and direct objects.


And
SWUM
T has a similar interface to Google
Desktop
search because we are using a
similar query mechanism and the relevance is determined by our

SWUM
score exceeding
some threshold which we dynamically determined based on the averag
e of the top 20
results. And we used for search tasks we used eight concerns from a previous study,
which had 60 relevant methods. We were searching for across 10,000 irrelevant ones in
four different programs. And in terms of queries, we used the top p
erforming queries
based on a

prior
evaluation. We did not want to compare how well users could use these
search tools. We wanted to see when a user was really able to get a good query in terms
of precision recall or F measure, when were they most effecti
ve and compare the
techniques under those ideal situations. So the measures we used again were Precision,
Recall F, Measure commonly used in information retrieval.


So what does it look like?

So here we have a box plot of the F measure. Just as a quick
reminder
, the

shaded middle region is the middle 50% of the data. The horizontal line in
the middle is the median. The plus is the mean. And we can see as we look from ELex
to GES,
FindConcept

and
SWUM
T all the way to the right, if we look at
the
height

of the
box of
SWUM
T, we consider
SWUM
T to be more consistently effective than the other
techniques. It doesn't have the shortest box, but on the whole it has the smallest box that
is also highest. When we analyzed recall and precision, we found that ELe
x similar to
GRAP
, had good recall but the precision was so poor that overall it inundated the
developer with results that were irrelevant. In terms of precision, we found that
SWUM
T
and
FindConcept

were best. So that means using phrasal concepts did imp
rove our
precision, but in terms of recall, GES, which was the Google equivalent and
SWUM
T
were the best. So the advantage of
SWUM

over our prior competitor
FindConcept

was
that it had just as good precision, but it slightly improved the recall because it

is using a
more general representation of phrasal concepts and not just for direct objects anymore.


So this was really a more preliminary study and we would like to do a more widespread
study to help flesh out these results
,

because these results are not

statistically significant
because we were using a small number of queries. We were just using the best in terms
of precision recall and F measure. So we want to do a

more
general broader study to
further evaluate this. So slightly switching gears just
for a second… Yes?


>>: Do you have an example of what kind of query the users were initializing on these?


>> Emily Hill: Each type of search is going to give a different type of query. So ELex is
going to be a regular expression query.


>>: [inaudibl
e] type of regular expression [inaudible] or something like that.


>> Emily Hill: And the users were allowed to interact with the tool until they were
satisfied and…


>>: And so they were given like here is what you're searching for. Now implement it
us
ing that.


>> Emily Hill: Yes. Good question, thank you. So GES and
SWUM
T were the same
keyword queries.
FindConcept

had a specific verb followed by a specific direct object
and they could look at the search results and stop when they were satisfied an
d the last
query was one that we used.


>>: And the sort of things that were searching for were like find me a method that prints
out logging information or something?


>> Emily Hill: Well
,

it was more

feature
oriented. So they may be shown a screen sho
t
and said find the code that implements this feature. They might be given a snippet of
documentation and said okay
,

find the code that implements this aspect of the system. So
it was more feature based. Good question.


So
I'm
slightly switching gears
,

because after we have done this general search to find
these seeds to start from,

then
we want to also further refine that and explore the program
further. But these are really two different problems with two different goals. So in
search we are trying t
o find
seeds
, whereas

in
exploration we are starting from

these
seed
starting points. We got these pegs in the code that we can start hanging on
and we are
trying to build our understanding of the code around it locally.

We are looking at
relevant elemen
ts that are structurally connected to

these
seed starting points. So in
search our goal is really high precision because we are searching the entire code base. So
we have this huge set of methods that we are trying to prune down, whereas in
exploration w
e are trying to improve the recall further.


So our solution was to use phrasal concepts and
SWUM

to improve precision. And
actually even though I have complained about the bag of words approach for information
retrieval, it is actually very good for high

recall. It is very greedy. So when we are
exploring, we actually argue the

bag
of words. And our solution we created a tool called

Dora
the program explore
r, and
use
d

program structure and natural language as well as
location, signature versus body. S
o in general
,

this is like the example I showed before.
We used a frequency of the query words. So for example I have do add on the left and

an
irrelevant method delete comment on the right
,

and the relevant method had six
occurrences of the query words.

The irrelevant one just had two. And we

weighted
the
contribution of the frequency based on the signature being more relevant than the body
and so we trained two weights using [inaudible]

progression

on a training example, to
calculate that score.


We a
lso compared it to additional techniques. So we compared our Dora score which
was more advanced to

naïve
approaches, and and or. And return true if all the query
words were present. Or return
ed

true that something was relevant if any one of the query
wo
rds were relevant. And we also compared our technique to a purely structural
approach called
Suade

and evaluated it on eight concerns

mapped
by three independent
developers, which translated to 160 methods and over 1800

pages
with overlap. And
what we fou
nd was that using natural language and program structure together does
outperform just using program structure. But you have to be careful how you integrate
that natural language information. Just putting natural language information, for example
if you
just selected the and naïve approach, you would be worse off than just using
program structure alone. So how you combine the natural language information is very
important. So success is highly dependent on the textual scoring performance and our
more ad
vanced Dora did appear to outperform the other techniques
.


So our real question is though, is if we take our highly precise search technique and a
greedier exploration technique like Dora to improve recall, how much more of the
concern can we get? How

ma
ny
more relevant results can we get for each search task?
So what we did is we compared the three state
-
of
-
the
-
art search techniques with
SWUM

search plus Dora exploration. So in the bottom we have

ELex which is like
GRAP
, GES,
FindConcept

and then all t
he way to the right we have
SWUM

search plus Dora
exploring one edge away. And if we look at the medians, we can actually see that the
median results are significantly higher than for search alone
, s
o right now we see that this
is a promising direction to

go in, that we can continue improving the results in general. If
you're going to pick one solution, you are going to want to pick the solution that has the
highest
median that

is most effective most of the time. We are never going to have one
silver bul
let that is a perfect search all of the time, but search plus Dora does a better job
in general than the other techniques.


And we also found that results can be further improved if we assume that there is a
human pruning away at the irrelevant search resu
lts before the
y went to the exploration
phase, s
o in the first bar
S

plus Dora
,

I took every search result in the top 10 and explored
one

edge away
and that was the accuracy. If we assume a human is pruning away some
of those irrelevant ones, we get even
better results. But again the F measure is still only
at 60 because that is about the limit that words are going to get us even with the program
structure, even with Dora. So this is a preliminary result and we found that it was very
exciting. We also d
id some other studies and found that when we were searching using
any base search technique, if we went 2 to 3 edges away from the starting
seeds
, we
could get like 100% of the relevant results. So within 2 to 3 edges of the call graph you
can get almost
the entire concern because programs are so highly interconnected, I
believe is the reason for that [laughter].


>>: Did you look at how many edges that required
you
looking at?


>> Emily Hill: Well, it grew…


>>: If you can reach like 20% or 30% of the
program from any point in three edges
then…


>> Emily Hill: No. The sweet spot was like returning the top seven results and then
going and looking at the top five results, t
w
o edges away. We found that was the sweet
spot. It was
--
so we got 80% of the c
orrect results across the eight concerns that we were
looking at using that. So it's possible that you can pick these thresholds and combine
them in such a way that you can get a win. Because we found that to get every relevant
result we needed to add tw
o more results. So it grew exponentially, but there was a
threshold where you are not overwhelming the developer and you're still returning more
relevant results. But finding
that

and it might

be different from person to person as well
,
b
ecause different

people want to look at different numbers of results.


>>: [inaudible] different programs, right?


>> Emily Hill: Exactly. Well, and it is highly dependent on the program itself. And
actually what makes this problem so challenging is that the query is
really one of the
most important determining factors in the success of the search, even more so than like
word choice in the program and the structure, because

if
the query is a bad query, it
doesn't matter how good the search technique is, it is going to
be a bad result. So it is a
function of the query itself, the word distribution, the program and the structure. So that
is why it's so hard to make a general solution.


So what is the research impact we've had so far doing this work? The navigation and
exploration tools they were typically manual and slow for large and scattered code. We
added automated support to leverage using natural language and program structure
information as well

as
location to outperform competing state
-
of
-
the
-
art techniques. I
n
terms of search tools, they typically return

irrelevant
results and miss relevant ones. We
helped improve the precision by capturing the semantics of word occurrences using these

phrasal
concepts in
SWUM
, as well as improving recall by combining search
and
exploration. But there is certainly more we could do along these lines. And so just to
summarize, the insights I tried to share with you today are combining natural language
and program structure, taking advantage of word location and using word
cont
ext through

phrasal
concepts. I have talked about using that to improve query reformulation,
software search and program exploration. But there are tons of other software
engineering applications where this could be used.


I am just one woman and I haven
't had time to try it out in all of these different places.
So
SWUM

captures

phrasal
concepts and our goal is that this can become an interface for
software engineering tool designers and researchers to help improve linguistic analyses
for software. That

is our long
-
term goal in trying to develop this. And in future we are
hoping to really explore the other ways that text and specifically this
SWUM

model can
be used for other software engineering applications to solve other software engineering
problems
and to study, to keep pushing the search further, a study

what
actual developers
are searching for so we can further refine and better meet developer needs. Maybe they
are not all just general purpose. Maybe we need to start specializing. Okay. So that
's it
for me, unless you have more questions.


[applause].


>>: One more question. Have you thought at all about how

one may change
languages
or annotations that programmers can add to improve this process? I am really more
interested in the language be
cause anything you can do at the language level to make this
easier or more accurate.


>> Emily Hill: Right now developers have tons of choices in choosing their identifiers
and I think that's great power, because they can be really flexible, but at the s
ame time it
makes it really hard. There is no standard naming conventions. How you are calling
methods and naming the methods, if it is slightly more forced in the structure of verbs
and direct objects, there would be a lot less ambiguity. If I knew tha
t this is the action
that is taking place and this is the object that it is working on, that would make it a lot
clearer I think for what we are trying to do.


>>: Verifying the noun verb…


>> Emily Hill: Right, the action and the object.


>>: [inaudibl
e] the names.


>> Emily Hill: Yes. And really what we found in general is that actions and verbs

in
source code are typically used very interchangeably and synonymously, and that is
actually the biggest source of issue. But the nouns, they tend to be pr
etty consistent
because they are typically objects that get one name and they are used everywhere. That
is one fixed name. So it is an interesting blend of word choice and word restriction. It is
way more restrictive than a random average natural langua
ge document, because you
don't get all of these different forms of the words, because once that identifier is fixed,
everywhere else in the program has to use it that same exact way, but the actions, those
typically aren't objects
,

typically don't encapsul
ate actions, so there's a lot more word
choice and variability. So anything from execute to fire, to do, we have so many
synonyms for that one simple concept, compute, compare; there are a lot of different
verbs that are used to mean the same thing.


>>:

So seems like it may not be actually changing the language but helping them as
developers. So like you could have an IDE that can give you choices about the
[inaudible] you should be using in certain points or like there are words with the

squiggles
unde
rneath…


>> Emily Hill: Yeah, like are you sure you mean this?


>>: Yeah.


>> Emily Hill: Yeah, we had a method like do print or what are the semantics of helping
verbs like can fire, can something fire, what does that mean for what that method is
doing
? And there is a lot that you can program in and learn from how it's used right now,
like for example host et all, they did a job on programmer’s phrase before they analyzed
the verbs and when is a verb used, and
what does that method structure typically
look like
when that verb is used there and so they can debug poorly named methods. So encoding
that and building it into the IDE would really help us better leverage the text that's in
there because it would be more organized. The more ambiguity that you

can take away,
the better the results are going to be. Uh
-
huh?


>>: In the opposite direction, we try to preserve that ambiguity as much as possible in
syntactic analysis, right? So that you don't just take
--
I mean I don't know to what extent
you do th
is already. Are you taking just one best analysis or do you have some packed

forced
representation of the natural language side, or?


>> Emily Hill: We try and preserve the original as much as possible.


>>: I mean of course it explodes, right? But in
practice, so a lot of [inaudible] is in
machine translation, right? So when you throw syntax in, you can explore all of the
syntactic possibilities

presented by
one English sentence when you are translating into
Japanese, but you can explore a highly like
ly subset and if you have that ambiguity
preserving representation [inaudible] exponential combinations, we could get much better
wins that way then just looking over the one best that syntactic [inaudible].


>> Emily Hill: Yes
,

so right now in the model
that I have shown you, it is just one best.
I pick one way of doing it. But we had an undergrad that was working on using more
advanced analysis and using even more positional information and she was looking at all
of the different possibilities and then

choosing between them using accuracies and things
like that. So we have pushed it. It's not quite integrated because every time you change a
part of speech tagging, you have to change the parsing rule implementation
,

and so we
are working on making a ve
ry general way that that can be done into a file or something
to make it really easy to change, but not right now our

challenge is
how do we design this
interface in the system to make that really easy to change in the future. But definitely,
and the more

you can take it
--
we have tried to avoid presenting multiple possibilities
other than something like an equivalence,

like okay,
these two things are connected; we
think they are the same. We have tried to avoid giving t
w
o parses because they could
have co
mpletely different semantic parses if you have two different syntactic parses. So
we have tried to pick one
,

but maybe affiliate accuracy with it. That is not implemented
yet, but the goal would be with each rule associate an accuracy for both the part o
f speech
tagging and the semantic parsing.


>> Christian Bird: All right,

cool.


>> Emily Hill: Thanks.