Human-Aided Computer Cognition for e-Discovery

gudgeonmaniacalΤεχνίτη Νοημοσύνη και Ρομποτική

23 Φεβ 2014 (πριν από 3 χρόνια και 1 μήνα)

86 εμφανίσεις

Human-Aided Computer Cognition for e-Discovery
Christopher Hogan
San Francisco,USA
Robert Bauer
San Francisco,USA
Dan Brassil
San Francisco,USA
Throughout its history,AI researchers have alternatively
seen their mission as producing computer behavior that is
indistinguishable from that of humans or as providing com-
putational tools to augment human intelligence.The legal
context provides a particularly rich domain for exploring the
benefits of both approaches because human judgments are
central to the delivery of just outcomes with formal mecha-
nisms for challenging any and all judgments within its core
practices and institutions.Through exploration of legal e-
discovery,we explicitly address these alternative paradigms
for applying interdisciplinary sciences to knowledge-based
systems.We demonstrate through a series of quantitative
studies that a system architecture for Human-Aided Com-
puter Cognition automates and replicates judgments signif-
icantly better than the ubiquitous,traditional Computer-
Aided Human Cognition of senior attorney relevance as is
defined in legal practice.
Information Retrieval,Knowledge Representation,Knowl-
edge Engineering,User Modeling
The legal imperative to disclose and/or discover all rele-
vant material in unstructured electronically stored informa-
tion is producing a burden that threatens the delivery of
justice for all.Lawyers and their large institutional clients
increasingly face the enormous problem of how to efficiently
and efficaciously conduct searches for relevant documents in
heterogeneous haystacks of electronic data.The heteroge-
neous complexity of datasets subject to discovery is rapidly
approaching the threshold of where hundreds of millions of
documents are being made subject to more or less “routine”
searches in a variety of litigation and investigatory contexts.
E-discovery applications often involve massive datasets that
couple formal documents both in character-coded and scan-
ned formats,informal communications from a variety of
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for prot or commercial advantage and th at copies
bear this notice and the full citation on the rst page.To cop y otherwise,to
republish,to post on servers or to redistribute to lists,requires prior specic
permission and/or a fee.
12th International Conference on AI and Law 2009 Barcelona,Spain
Copyright 2009 ACM1-60558-597-0/09/0006...$10.00.
systems,metadata in both standardized and nonstandard
forms,and databases.Information access in these applica-
tions involves more than simply search—helping the searcher
to make sense of what they find can be equally important.
Protection for confidentiality can be particularly challenging
in e-discovery applications,both because a complex set of in-
terlocking rights and privileges must be accommodated and
because sensitive information is often intermixed in ways
that are difficult to segregate.
This challenge is being viewed as a search problem fol-
lowed by human review.As a consequence,search tech-
nologies in the marketplace are evolving to encompass what
started as AI techniques some years ago.Today keyword
search using Boolean expressions is incorporating techniques
that represent contextual and semantic variations of the user
query.The current methods receiving the most research at-
tention involve various machine learning approaches.How-
ever,nearly all are applied in a paradigmwhere the resulting
systemaugments a user’s retrieval effort with the final choice
of relevance dependent on a human judgment.Such a system
leverages technology to provide Computer-Assisted Human
Cognition (CAHC).This reliance on the user as ultimate
output arbiter fails to leverage the advantage of machines
to replicate consistently a set of procedures for an unlimited
number of times.In a litigation or regulatory context,the
use of CAHC systems either causes indefensible processes
to be employed or produces results that demonstrably fail
to produce at least half the documents relevant to a legal
matter.In this paper we describe a system architecture,
application methods,and experiments that dramatically ex-
ceed all previously published assessments of CAHC retrieval
We demonstrate that achieving high Recall with high Pre-
cision can only occur when computer ‘cognition’ is employed.
Obtaining machine simulation of human intelligence,the
central challenge of AI,depends on the user ‘assisting’ rather
than determining the search result.While the core technol-
ogy may be similar,Human-Assisted Computer Cognition
(HACC) better achieves a user’s search goals if the user
intent can be modeled with a known accuracy.As shown
schematically in Figure 1,building a HACC for information
retrieval (IR) can be viewed as transforming the ubiquitous,
traditional human-based system (CAHC) by encompassing
document review into the internals of the computational sys-
tem.The AI efforts of knowledge engineering and knowledge
representation combined with statistics and social science
techniques from the information (sometimes called library)
sciences enable user modeling to be conducted and itera-
(a) Computer-Aided Human Cognition
(b) Human-Aided Computer Cognition
Figure 1:CAHCvs.HACC(note systemboundary)
tively refined to provide a paradigmatic shift in the cost,
time,and accuracy of legal e-discovery.
The extraordinary effectiveness of the Relevance Feedback
(RF) paradigmis well established.Recent work [13] treating
the information retrieval task as a form of classification has
demonstrated that the most effective way to achieve high
performance on a particular task is to acquire a large num-
ber of document assessments.How these assessments are
acquired,however,is often left unspecified;in real world,
time-synchronous tasks,we cannot wait for assessments be-
fore addressing the task:such assessments,if they are to
be used,must be created while addressing the task.In-
deed,the performance of IR systems is seen to be a function
not only of the inherent properties of the system,such as
the algorithms used,but also of the information available
as input to the system;the nature of the input informa-
tion,including specifically the quality and quantity of such
information is a critical determinant of performance.That
additional information can bring improved results has been
recognized within the evaluation community for some time,
as expressed through the existence of evaluation tasks such
as the Interactive [6] and HARD [1] tracks at TREC.Such
evaluations have sought to bring additional information to
the information retrieval task in a controlled manner,limit-
ing both the degree and the manner of information transfer.
To some extent,such limitations were driven by the need to
pose a controlled experimental paradigm wherein observed
improvements could be reasonably attributed to the effect of
the additional information.The results of such experiments,
while showing that additional information does indeed help,
are limited by the fact that the nature and amount of infor-
mation was limited by the experimental conditions.
Although document review is typically delegated to more
junior attorneys or out-sourced,it is ultimately the Lead
Attorney whose notion of relevance is the determinant.As
long as the interaction with this single user whose judgment
defines relevant output is not limited to a single exchange,
iterative exploration of the topic becomes possible,as ex-
plained in our analysis,below.By providing the IR system
with a single target of relevance,modeling of user intention
can be leveraged to transform review from a human task in
Figure 1(a) to a machine task in Figure 1(b).
The key questions to be answered are therefore these:
• How can we most effectively harness the knowledge
User Modeling
User Modeling
Figure 2:System Architecture
that the user makes available to the system in order to
improve performance?
• Given limitations on the user’s time and attention,
what is the best way to structure the conversation with
the user so as to acquire the most information with the
least effort?
• Given a certain amount of information,how best to
go about the task of representing it in a way that is
consumable by an automatic system?
• How can such a system deal with the real world ex-
igencies posed by operating in such an environment,
including a fallible user whose interpretation is sub-
ject to change?
These are the questions with which this paper is concerned.
We describe a hybrid human-computer symbiosis that ad-
dresses the problem of achieving high performance on IR
tasks by systematically and replicably creating large num-
bers of document assessments enabling a HACC AI system
for e-discovery.
The internal architecture of our HACC system is illus-
trated in Figure 2.In our system,the Retrieval ←→Review
feedback loop is modeled by a combination of four separate,
yet interconnected,processes:User Modeling,Assessment,
Classification and Measurement.While ours is not the only
architecture possible for a HACC system,we have found the
use of assessment as an intermediary relevance-bearing ar-
tifact serves as a useful means to represent relevance and
convey it to all system components.
2.1 Proxy
The proxy is an internal agent who co-constructs a theory
of relevance with the user via User Modeling.The proxy
provides guidance to document assessors and resolves intra-
and inter-assessor discrepancies to ensure that errors are re-
solved in favor of the proper interpretation of relevance.
2.2 User Modeling
User Modeling is the process by which the proxy co-deter-
mines a theory of relevance with the user,iterating the pro-
cess to increase the likelihood of relevance within the sys-
tem’s output.
2.3 Assessment
The assessment process is designed to (i) generate a large
number of assessed documents
(ii) of the appropriate kind
(iii) with minimal error.The assessment process consists of
an initial assessment of all documents of interest and subse-
quent error correction procedures.
2.4 Classication
Document-assessment pairs generated during assessment
are used as training data for a supervised classification sys-
tem.The classifier is trained over available assessments and
the resulting model used to perform a binary classification
of all documents.
2.5 Measurement
The performance of the classification system is regularly
evaluated in order to test its efficacy.The classification sys-
tem is run over all documents in the corpus.Following clas-
sification,a random sample is drawn and reviewed by doc-
ument assessors.Data generated by the evaluation process
are used to tune the systemand may result in the proxy and
user modifying the theory of relevance.
We take relevance to be both a user-derived as well as a
system-constructed property whereby a user or systemmaps
relevance to text.
Thus,the effectiveness of an IR system
can be measured by how well its relevance-to-text mapping
intersects with the user’s relevance-to-text mapping.This
approach entails a user and an information need:a text
is deemed relevant by a user if it satisfies that user’s in-
formation need.Thus,at some level of an IR system,there
must exist a representation of a user and his/her information
need.In CAHC systems,this representation is often implicit
or cursorily defined,as little attention is paid to the user’s
information need as distinct from the user.In an HACC
system,however,this distinction is best made explicit,and
User Modeling (UM) serves as a powerful source of input by
providing a mechanism by which external knowledge can be
formalized into the system via query development,vocabu-
i.e.(document,assessment) pairs,where assessment is R
(relevant) or NR (not relevant).
Saracevic [11] describes five types of relevance:(i) Sys-
tem or algorithmic relevance,(ii) topical or subject rel-
evance,(iii) cognitive relevance or pertinence,(iv) situa-
tional relevance or utility,(v) affective relevance.The UM
framework we describe seeks to model and represent the
user-relevance types:cognitive relevance/pertinence,situa-
tional relevance/utility and affective relevance.The infor-
mation contained in the UM representation is used to eval-
uate the system-relevance types:system/algorithmic and
topic/subject relevance.Moreover,The UM representation
can also be modified by information surfaced via system-
UMis understood as a two-fold endeavor:(i) constructing
a definition of relevance and (ii) iteratively interacting with
a user to increase the likelihood of relevance in the output.
We follow [12] in positing that mediated interaction,
teraction of a user,a human intermediary and an IR system,
is the most effective formof UMin IR.Within such a model,
an intermediary is an“intelligent agent constructing,imple-
menting and modifying user models in all their complexity
with considerable feedback”[12].
3.1 UMas co-construction
There are two central tenets of our approach to UM:(i) a
user is seeking to resolve an“anomalous state of knowledge”
and (ii) the user is unable to precisely specify what informa-
tion is needed to resolve the anomalous knowledge-state [3].
These tenets underlie our own endeavors as intermediaries:
we are seeking to resolve an anomalous state of knowledge
as it pertains to satisfying the user’s information need and
we are unable to precisely define what information will sat-
isfy the user’s information need.Moreover,we recognize
that users and intermediaries have access to external knowl-
edge sources (personal knowledge,reference guides,the tar-
get corpus,etc.) that can be leveraged to inform and refine
the model.Thus,the act of UM is a co-construction of
information needs and mutual knowledge
in a shared rep-
UMcomprises four main areas:(i) use case,(ii) scope (iii)
nuance and (iv) linguistic variability.The resultant repre-
sentation is a description of subject matter,that,if found
in a document,would make that document relevant (hence-
forth Subject Matter Model).
3.1.1 Use Case
Use case describes the high-level aspects of the user’s ob-
jectives,including meta-objectives,and allows us to balance
those needs appropriately.While the stereotypical user need
is to produce to opposing counsel a set of documents deemed
responsive to the Request for Production (RFP),there may
be other needs,such as to mitigate the risk of being accused
of under-producing or over-producing.The decision to prior-
itize one risk over the other has far-reaching design decisions:
under-production > over-production implies a narrow,more
exclusive conception of relevance whereas under-production
< over-production implies a broad,more inclusive concep-
tion of relevance.
3.1.2 Scope
We define scope as the breadth of concepts considered
relevant by the user.The goal of User Modeling for scope is
to define the boundaries of relevance for a given conceptual
3.1.3 Nuance
Nuance refers to the degree of specificity required to be
relevant.Discussions of nuance and specificity typically cen-
The relationship of the intermediary to the user and the IR
system is one of systems boundaries.Buckland and Plaunt
[5] write that “systems boundaries define what is considered
the ‘system’ and what is considered the ‘environment’ ”.On
this definition,whether or not the intermediary is within the
system is determined by how integrated the intermediary is
into design of the overall system.
For more on co-construction of knowledge and mutual un-
derstanding,see [4] and [10].
ter on the semantic relations of hyponymy and hypernymy.
In some cases,a general representation of the concept,e.g.
dog may be considered relevant in addition to specific in-
stances e.g.dachshund.For other cases,only the more
specific instance will suffice for relevance.
3.1.4 Linguistic Variability
Linguistic variability is related to,but distinct from,nu-
ance.We define linguistic variability as the variety of ways
a concept can be expressed,whether lexically or syntacti-
cally.Two approaches are common:defining each concept
as a closed set or defining each concept in terms of pertinent
As assessed documents are one of the primary means by
which relevance is defined in the system and conveyed be-
tween components,care must be taken so as to ensure that
the assessments are of high quality and appropriately cap-
ture the full breadth of relevance.The presence of much
irrelevant or unreliable noise can significantly reduce the
ability of the system to generalize or,at best,increase the
number of assessments needed to generalize properly.In this
section,we describe the process we used to generate assess-
The motivation for the process described here is to (i) gen-
erate a large number of assessments (ii) of the appropriate
kind (iii) with minimal error.Assessed documents and asso-
ciated annotations form the primary input to classification.
The amount of information contained in these artifacts is
determinative of a high quality result.Given the large num-
ber of relevant results in many discovery tasks,however,it
is in general impossible to reliably assess more than a small
fraction of the number of documents to be retrieved.Addi-
tional mechanisms must therefore be employed to actively
determine likely sources of additional relevant documents
with distinct language.
In order to ensure consistency between assessors,a por-
tion of assessed documents are independently assessed a sec-
ond time by another assessor.The resulting assessments
are compared,and disagreements are resolved by the proxy.
Bringing mismatches to the attention of the proxy,who was
instrumental in the co-construction of the theory of rele-
vance,ensures that systematic errors are resolved in the fa-
vor of proper topic interpretation.
4.1 Assessment Guide
The work of assessors is informed by the theory of rele-
vance that the proxy has co-determined with the user.In
order to communicate this intent,and to give added guid-
ance to assessors in specific cases,assessment guidelines are
drawn up by the proxy and communicated to and among
the assessors.It has been shown,by e.g.[8],that annotator
agreement can be enhanced by increasing amounts of de-
tail in an annotation guide.The purpose of the assessment
guide,then,is to provide detailed direction to assessors be-
yond that shared between the user and the proxy.To be
sure,the guidance provided by the proxy is grounded in his
or her understanding of the theory as shared with the user.
The assessment guide,however,provides additional direc-
tion to the assessors on howto handle known and anticipated
specific instances of the topic.The assessment guide is also
maintained as a continuous record of decisions made about
particular cases and the reasoning behind those decisions.
4.2 Assessment Process
The assessment process we use is designed to address the
above goals while providing a straightforward and efficient
workflow.The process consists of an initial assessment per-
formed on all documents of interest,and subsequent error
correction steps,performed on samples of the population
with specific characteristics.
4.2.1 Initial Assessment
Assessors review documents drawn randomly using inter-
nal sampling procedures.Documents are assessed for rele-
vance (R) or non-relevance (NR).
4.2.2 Relevant Passage Identication
Following initial assessment,a portion of the documents
that have been assessed as R undergo a second round of
assessment to identify relevant passages in the document.
Relevant passages form one of the inputs of the classifier,
where they serve to narrow the focus to highly relevant por-
tions of potentially very long documents.
To extract relevant passages,assessors re-read R-assessed
documents,and attempt to identify portions of the text that
serve as indicators of relevance.
In addition to generating additional training information,
passage extraction serves the secondary purpose of validat-
ing the initial assessment of relevant documents.Documents
for which no relevant passage can be found are flagged for
review by the proxy.Upon review,the proxy may either
indicate the relevant passage,leaving the document as R,
or overturn the R assessment in light of the lack of passage
evidence,changing the document assessment to NR.
Although logically related to assessment,passage extrac-
tion is performed separately by an assessor other than the
one who provided the initial assessment.This is done to
ensure that passage extraction fulfills its function as a part
of quality control,insofar as a portion of the relevant docu-
ments are assessed independently by more than one assessor.
4.2.3 Cross Check
Like R documents,documents with an initial assessment
of NR must be quality checked via an independent second
assessment.However,unlike R documents,no relevant pas-
sages can be expected in NR documents,and there is little
marginal benefit to entertaining a distinct process.There-
fore,a portion of NR documents are re-reviewed by a sec-
ond assessor.Disagreements between the initial and second
review are identified and flagged for review by the proxy.
Upon review,the proxy may choose to leave the document
as NR,or may overturn the initial assessment and make the
document R.
4.2.4 Other Quality Controls
In addition to the Relevant Passage Identification and
Cross Check procedures described above,which have been
explicitly designed for quality control,improperly assessed
documents are sometimes detected in other parts of the sys-
tem.Although these ad-hoc controls individually contribute
to only a small degree,taken together they form a third
branch of quality control.
Because assessors differ in their capabilities,level of ex-
pertise and knowledge of the topic,additional quality con-
trol measures are employed on a per-assessor basis.The
proxy therefore randomly selects documents that have been
reviewed by each assessor for spot-checking until the proxy
is confident of the assessor’s abilities.
4.3 Summary
We have described a number of mechanisms designed to
enable a large number of assessments to be made in a me-
thodical,efficient manner that minimizes errors and hews
closely to the shared definition of relevance established by
the user and system-internal proxy.The process also pro-
vides for evolution of interpretation as new exemplars are
sought and identified.
Iterative approaches to information retrieval,such as rel-
evance feedback,clearly offer benefits over a one-shot ap-
proach.Additional retrieval iterations provide the oppor-
tunity to uncover additional relevant documents or to re-
fine judgments on previously identified documents,and can
therefore potentially boost either Recall or Precision or both.
However,in order to attain any advantage over a single-shot
system,the iterative system must incorporate additional
knowledge during the iteration process.Blindly incorpo-
rating additional information with no attention paid to the
current state of the systemor the likely effect of such knowl-
edge,is a blunt instrument that neither offers insight into
the progress of the retrieval process nor provides direction
concerning those next steps which may be most effective.
The alternative paradigm,which we espouse,incorporates
explicit measurement of the system at different stages of
processing.While measurement entails a certain amount of
effort,the benefits are great.Among the primary benefits
of measurement is the insight it provides to establish the
current state of the system and the degree to which it has
attained desired outcomes.In real-world applications,it is
often possible to set minimum standards which will ensure
that the information needs of the user are being met subject
to other constraints.Measurement,therefore determines not
only the current state of the system,but also determines how
many iterations must be performed in order to achieve the
desired outcome.
In addition to providing insight into an iterative process,
measurement also informs decisions made during execution
of the process and provides the direction that is necessary
to make considered changes in the approach.Thus,for ex-
ample,if precision is seen to be low,additional effort can
be expended to more carefully refine assessments to reduce
errorful R assessments.If,on the other hand,recall is low,
additional efforts can be expended to find and assess addi-
tional relevant documents.Beyond the ordinary decisions
regularly taken during exercise of a task,measurement can
also be brought to bear to deal with extraordinary circum-
Measurement entails drawing random samples of classifi-
cation output during certain stages of processing and per-
forming assessments to determine the true status of docu-
ments in the sample.
In this section,we present results from two evaluations of
our HACC system.The first evaluation compares our re-
sults to that of a full manual review,the legal gold standard
for document review.The second evaluation establishes re-
sults in the context of an information retrieval task allowing
comparison with fully-automated approaches.
6.1 Comparison with Manual Review
Kershaw’s[7] independent study of our HACC system be-
gan with a set of 48,000 documents (230,000 pages,) which
were to be coded for relevance to three responsive categories.
The software was set up in accordance with the standard
practices outlined above.These included interviewing the
attorneys and reviewing documents to gain an understand-
ing of the relevance criteria for the case and training the
software accordingly.In parallel,six reviewers were trained
to conduct a manual review of a stratified randomsample of
43 percent of the corpus.The HACC systemand the manual
reviewers separately reviewed the documents and the results
were compared.
Assessment was conducted by assuming that when HACC
and the human reviewers agreed,the relevance determina-
tion was correct.Where there was a discrepancy (a doc-
ument marked responsive by one approach and not by the
other),the document was re-examined by the same reviewers
to determine (in some cases with some debate and arbitra-
tion) who was correct,the HACC or the human reviewer.
Crucially in this study,the same authorities (i.e.,the hu-
man reviewers) made the final determination of the Com-
puter Cognition assessments,eliminating alternative inter-
pretation accounting for difference in the final accuracy de-
termination.After this cross checking review was finalized,
the human reviewers were shocked at how many documents
they missed and were similarly startled at how well the soft-
ware achieved the objective of locating relevant documents.
Across all three codes,the HACCsystem,on average,identi-
fied more than 95 percent of the relevant documents,with a
high of 98.8 percent for one of the codes.The people,on the
other hand,averaged 51.1 percent of the relevant documents,
falling as low as 43 percent for one of the codes.For each
relevant document missed by the HACC system,the con-
trol review process missed 32 documents;that is,the risk of
failing to flag relevant documents for litigator review was 32
times greater under the traditional review process.In sum,
the results of Kershaw’s study demonstrated that the use of
a Classifier trained with a rigorous,iterative user-modeling
process reduced the risk of missing a responsive document
by 90 percent.Moreover,the effectiveness of this HACC
electronic process improves as it is tweaked throughout the
measurement,assessment,and quality assurance process.
These findings make sense considering that document re-
view work dependent on Human Cognition as the final de-
terminer is extremely difficult,with people demonstrating
subjective views of relevancy that vary both among indi-
viduals and inconsistencies for each individual reviewer over
time due to fatigue or other causes.Because there is no
stable,consistent user modeling of senior litigator knowl-
edge,the traditional CAHC review process actually ampli-
fies “noise” through the system introduced by accumulated
discrepancies arising from the repeated,varying human in-
terpretation and application of the senior litigator’s goals as
captured in cryptic coding manuals.Double CAHC review
(with a possible third,adjudication review in the case of a
disagreement) just lengthens the process and increases cost
to unacceptable levels without accuracy improvements be-
yond those obtainable by HACC.The AI process,on the
hand,consistently assesses every document with a rigorous
transfer of senior litigator knowledge of relevance through
user-modeling and machine replication and automation of
this judgment.
6.2 TREC Legal Track Interactive Task
In order to provide a realistic,large-scale,replicable ex-
perimental context,we evaluated our system in the context
of the TREC Legal Track Interactive Task.The Legal Track
was established by TRECto evaluate approaches to informa-
tion retrieval with application to the problems encountered
in the legal world,including especially e-Discovery.The In-
teractive Task was formulated in order to provide a more
realistic setting for the information retrieval task,and in-
cludes the following features:[2]
• A Topic Authority is the sole determiner of relevance
• Teams have 10 hours to interact with the Topic Au-
• Teams must submit a binary classification for every
document in the population
• Teams can appeal assessment decisions with Topic Au-
thority making the final decision
The Interactive Task was evaluated on the IIT Complex
Document Information Processing (CDIP) Test Collection,
comprising 6,910,192 documents released under the tobacco
“Master Settlement Agreement”.
Given this context,we evaluated on the following topic:
Topic 103.All documents which describe,re-
fer to,report on,or mention any “in-store,”“on-
counter,”“point of sale,”or other retail marketing
campaigns for cigarettes.
The process of executing this experiment included User
Modeling to establish internal topic definition,Assessment
to generate training data,Classification and Measurement.
6.2.1 User Modeling
In order to achieve the User Modeling goals of system-
atically accounting for and documenting Use Case,Scope,
Nuance and Linguistic Variability we employed a variety of
means of interaction designed to elicit the desired aspects of
relevance as efficiently as possible.
One of the reasons for performing User Modeling,as op-
posed to asking the user to directly perform assessment,is
that it is a much more efficient use of the user’s time.To a
certain extent,this reflects the realities of the legal context,
in which the user is typically a lawyer with a relatively high
hourly billable rate.Apart from this,however,User Mod-
eling enables the user to draw on the combined efforts of
an expert team of assessors while ensuring a level of quality
that meets or exceeds that which the user would be able to
Thus,while the Interactive Task provided for 10 hours
(600 minutes) of time to interact with the user,this must
be taken in context.During this time,enough information
must be exchanged between the user and the proxy so that
the proxy can make the same decisions that the user would
make.Any discrepancy between the user and the proxy im-
poses an upper limit on the ability of the systemto correctly
Figure 3:Example Questionnaire
capture the user’s notion of relevance,and directly impacts
the system’s performance.Considering the amount of detail
that must be specified,10 hours is conservative.
In order to capture the user’s definition of relevance,we
made use of several artifacts to efficiently guide the interac-
tion between the proxy and the user,including:
Interviews Interviews are conducted with the user to es-
tablish general topic interpretation and determine use
Questionnaires In order to test scope and linguistic vari-
ability,lists of known variants drawn from the corpus
were presented in the form of questionnaires,allow-
ing the user to rapidly communicate the boundaries
of his/her interpretation of certain concepts (cf.Fig-
ure 3).
Document Excerpts In order to establish Nuance and in
general determine the user’s opinion regarding actual
documents,we presented the user with excerpts of doc-
uments (cf.Figure 4).Excerpts are used in preference
to complete documents to restrict attention to a par-
ticular aspect of Nuance being tested.
Exemplar Documents In a limited number of cases,en-
tire documents are presented to the user in order to
test a previously established interpretation.
6.2.2 Assessment
In order to develop a population of document assessments,
representative documents were identified and assessed for as-
pects of the topic identified by User Modeling.In all,7992
documents were identified and assessed,covering 18 aspects
of relevance.Although substantial,7992 assessments rep-
resent 0.12% of the entire corpus that would need to be
reviewed in a purely manual review and 1.01% of the rele-
vant documents that would need to be post-reviewed for a
perfect CAHC system.
Figure 4:Example Excerpts
6.2.3 Measurement &Results
The assessments produced as described above were used
to train a classifier which,when applied to the 6910192 doc-
uments in the whole population,classified 608807 (8.81%)
as relevant.The quality of these relevance judgments were
evaluated via two independent measurements.
Prior to submission of results to TREC,a validation sam-
ple of 300 documents was drawn to verify system perfor-
mance.The sample was single-reviewed for the same defini-
tion of relevance used to generate the classifier.This sample
produced an estimated 729099 relevant documents (10.6%)
in the population,yielding an estimated precision of 82.3%
,recall of 68.7% (63.7,73.8).
Results were independently reviewed within the context
of the 2008 TREC Legal Interactive Task.[9] For the topic
in question,a sample of 6500 documents was drawn using
proportional stratified sampling from results determined to
be relevant and non-relevant by four teams and a pool of
ad hoc results.The 6500 documents were reviewed by law
students and cross-checked via an adjudication procedure
wherein disagreements were settled by the Topic Authority
who was responsible for relevance definition.This resulted
in an estimated yield of 786862 relevant documents (11.4%),
yielding estimated precision of 81.0% (79.5,82.4) and esti-
mated recall of 62.4% (57.9,66.8).
Acomparison of results with other TREC systems demon-
strates the benefit of the HACC design.The best CAHC
The 95% confidence interval is shown in parentheses.
The alignment of internal and external measurements gives
increased confidence in the estimated values.Note,in par-
ticular,that there is no significant difference between the
internal and external measurements.
system achieved precision of 71.1% (69.2,73.0) and recall of
15.8% (14.6,16.9).This result is clearly an outcome of an
CAHC architecture:by employing humans to post-review
machine output,high precision is attained.However,by
failing to incorporate significant human interaction earlier
in the process,it is difficult to achieve high recall.Another
point of comparison is with a pool of 64 ad hoc runs,most of
which are non-manual,i.e.involve no human involvement.
The recall of the pool is 40.3% (37.1,43.4) while precision
is 38.2% (36.8,39.6).As expected,this represents higher re-
call,but lower precision than the best individual entry in
the pool at 10.3% recall,76.2% precision.It is notable that
the HACC system significantly exceeds both the precision
of the highest individual and the recall of the overall pool.
Results fromthe TRECexperiments demonstrate that the
HACC system is able to attain high recall while maintain-
ing high precision.As illustrated by results across different
types of systems,achieving high precision is not difficult:
most systems that employ human review in some capacity
are able to leverage an innate human capability to gener-
ate highly precise results.For recall,on the other hand,it
is a challenging task to make significant headway in either
manual or fully automatic systems.Indeed,apart from the
HACC system,the best recall was observed in an aggrega-
tion of 64 individual results,and was still far lower than that
of the HACC system.Taken together,these results suggest
that the careful application of human interaction can result
in a systemthat achieves the precision of human review with
recall exceeding what has been previously observed in either
human or automated review.
We have presented a system for Information Retrieval
that engages in Human-Aided Computer Cognition to sys-
tematically address the challenges of modern e-Discovery.
User Modeling,Document Assessment and Measurement are
combined to provide a shared understanding of relevance,a
means for representing that understanding to an automated
system,and a mechanism for iterating and correcting such
a system so as to converge on a desired result.
The resulting system is shown to exhibit exceptional per-
formance on difficult Legal AI tasks,exceeding that of both
full manual review and traditional Computer-Aided Human
Cognition information retrieval systems.
[1] J.Allan.Hard track overview in trec 2003:High
accuracy retrieval from documents.In Proceedings of
TREC 2003,page 24,2003.
[2] J.R.Baron,B.Hedin,D.W.Oard,and S.Tomlinson.
Trec-2008 legal track interactive task —guidelines.
Available online at:http://trec-
[3] N.Belkin.Anomolous states of knowledge as a basis
for information retrieval.Canadian Journal of
Information Science,5:133–143,1980.
[4] J.S.Brown.A Symbiotic Theory Formation System.
PhD thesis,University of Michigan,1972.
[5] M.K.Buckland and C.Plaunt.On the construction of
selection systems.Library Hi Tech,1994.
[6] W.Hersh and P.Over.Trec-8 interactive track report.
In Proceedings of TREC-8,page 57,1998.
[7] A.Kershaw.Automated document review proves its
reliability.Digital Discovery & e-Evidence,November
[8] M.Maamouri,A.Bies,and S.Kulick.Enhanced
annotation and parsing of the arabic treebank.In 6th
International Conference on Computers and
[9] D.Oard,B.Hedin,S.Tomlinson,and J.Baron.
Overview of the TREC 2008 legal track.In
Proceedings of The Seventeenth Text REtrieval
Conference (TREC-2008),2008.
[10] J.Roschelle and S.Teasley.The construction of
shared knowledge in collaborative problem solving.In
C.O’Malley,editor,Computer-supported collaborative
learning,pages 69–77.Springer-Verlag,Heidelberg,
[11] T.Saracevic.Relevance:A review of the literature
and a framework for thinking on the notion in
information science.part ii:nature and manifestations
of relevance.Journal of the American Society for
Information Science and Technology,58(3):1915–1933,
[12] T.Saracevic,A.Spink,and M.-M.Wu.Users and
intermediaries in information retrieval:What are they
talking about?In User modeling.Proceedings of the
Sixth International Conference,UM97,pages 43–54,
New York,1997.Springer.
[13] Y.Zhu,L.Zhao,J.Callan,and J.Carbonell.
Structured queries for legal search.In Proceedings of