4. Preliminary Studies & Progress Report Aim 1Aim 2Aim 3Aim 4Aim 5Aim 6

walkingceilInternet and Web Development

Oct 22, 2013 (3 years and 7 months ago)


4. Preliminary Studies & Progress Report

In this section we review our recent work on EM signal analysis, pattern classification, and data mining (
); head modeling and source localization (
Aim 2
); ontology development and database modeling (
Aim 3
esearch on reading
related EM patterns (
Aim 4
); semantic mapping and ontology
based integration (
Aim 5
and design and implementation of a web
based portal to provide shared access to NEMO data and tools (
). In the past 6 months, we have had two pape
r acceptances (Dou, et al., 2007; Rong, et al., 2007). Our paper
in KDD2007, the top data mining conference, described the design and implementation of first
generation ERP
ontologies. We also have one paper in revision (Frishkoff, et al., 2007) and two un
der review (LePendu, et al.,
2007; Hansen
Smith, et al., 2007). These publications show significant progress on Aim 1 (pattern
classification, data mining), Aim 3 (ontology development, relational modeling), and Aim 5 (design and
implementation of a web
sed portal).

Progress on Specific Aim 1: Pattern classification and labeling in sensor space

In the past year, our neuroscience and computational researchers have collaborated to make rapid progress
on development and evaluation of an automated system fo
r pattern classification and labeling. Our work in this
area is summarized in three recent papers (Dou, et al., 2007; Rong, et al., 2007; Frishkoff, et al., in revision).

Development of an automated system for pattern classification and labeling

As descr
ibed in Section B, there are alternative methods for transforming sensor
referenced ERP and ERF
data into discrete patterns for labeling. Traditional methods include identifying the time of peak amplitude within
a specified time window (“peak picking”) and

computing mean amplitude over a time interval for each electrode
(“windowed analysis”), or electrode cluster (“region of interest,” or ROI). The typical practice is for researchers
to inspect and label data that have been averaged across subjects, and the
n to extract summary measures,
without verifying that labeled patterns meet objective criteria (rules) for a particular pattern of interest. This
procedure can be problematic, as selection of temporal windows and channel sets is often subjective, and
ria for pattern identification vary across research labs, making it hard to compare different results.
Moreover, with traditional methods, overlapping patterns may be confounded during a given time window (e.g.,
Frishkoff, 2007). By contrast, our goal is t
o develop an objective process for classifying and labeling data at the
individual subject level. Only verifiable instances of a pattern are to be assigned pattern labels, and the focus is
on characterization and labeling of pattern instances in each indiv
idual subject and experiment condition.

To support large
scale meta
analyses of labeled patterns across research labs, experiment paradigms, and
imaging modalities, it is important that these methods be automated. To this end, we have identified several
ethods for automated segmentation of data into discrete spatiotemporal patterns for labeling: e.g., Factor
Analytic methods (e.g., PCA), Microstate Analysis, and Segmented Regression. For our initial studies, PCA
offered several advantages. First, PCA is a
ble to separate spatially and temporally overlapping patterns, which
are confounded in traditional measures. Second, PCA automatically extracts a discrete set of patterns and is
therefore easily inserted into an automated process for pattern extraction and

classification. In this section we
describe results from application of tPCA
based pattern analysis to several high
density ERP datasets.

PCA methods
. PCA transforms ERP data into a small set of latent temporal patterns (Dien & Frishkoff, 2005).
Each of

the resulting temporal patterns is associated with a scalp topography, which is invariant across time.
For example, Figure 1 (C) shows the time course associated with the fourth principal component extracted
from an ERP study of word/nonword processing. T
his component is characterized as a phasic peak in activity
at ~100ms after presentation of a stimulus. Figure 1 (B) shows the scalp projection for this component, which
is indexed by a positivity that is maximal over occipital areas. Based on the time cou
rse and spatial distribution
of this PCA pattern, experts can easily recognize it as an instance of the “visual P100” component. PCA is
therefore easily inserted into an automated process for pattern extraction and classification. In a series of initial
udies for this project, we applied temporal PCA to several ERP datasets, using the Dien PCA Toolbox,
written for MATLAB (Dien, 2004). Based on prior research (Frishkoff, et al., 2004; Frishkoff, 2006), we defined
seven spatiotemporal patterns that are comm
only observed in ERP studies of visual word recognition (listed in
Fig. 1(A). Next we defined statistical measures to capture spatial and temporal characteristics of the latent
(PCA) factors for each subject and experimental condition. Rules for each patte
rn were operationalized using
these attributes. Given these pattern definitions, we developed an automated procedure, implemented in
MATLAB, for applying pattern rules to the metrics extracted for each of 534 observations (individual subject

ERPs). After a
pplication to test data, we evaluated the results and modified the pattern rules. For example,
after three test iterations, the visual “P100” pattern (P100v) was defined as follows:

) For any n, FA

= P100v

80ms < TI
max (FA
) ≤ 150ms

Temporal criterion

mean(ROI) > 0

Spatial criterion

) = stimon

Functional criterion #1


Functional criterion #2

where FA

is defined as the nth PCA factor, TI
is peak latency, IN
mean(ROI) is mean amplitude over a
interest (ROI), and ROI for P100v is “occipital” (i.e., mean intensity over occipital electrodes).

Figure 1 shows the results of applying the PCA
based autolabeling procedure to one dataset,

acquired from 36
subjects in 6 experimental
conditions (total
#observations = 216). The
rules for 8 patterns of interest
(see Figure 2 (A): column
labels) were applied to each
observation. The summary
table in Figure 2 (A) shows
the percentage of
ions that met rule
criteria for the first 7 factors.
Results for the P1 and N1
were encouraging: ~83% of
the observations showed a
match to the pattern rules,
and there was no pattern
splitting (i.e., if the P1 was
detected, it was always found
in Factor
4). Further, we
found good agreement
between expert ratings and results from autolabeling. Mean confidence was lower when there was a mismatch
between the expert and auto
labeling results (2.71 vs. 1.81; p<.001), suggesting that probabilistic ratings may
e a useful way to capture variation that is not accounted for by all
none judgments (see Sec. 5).

Later patterns were detected in a smaller percentage of test subjects and either split across multiple factors
(e.g., the MFN was assigned to Factors 2 an
d 7), or were clustered together, rather than mapping to distinct
factors (i.e., the N2, N3, and P1r). These results may reflect a misallocation of pattern variance across PCA
factors, latency jitter of components across subjects and conditions, or inaccur
ate pattern rules. One
implication is that the robustness of our pattern classification methods is most strongly challenged when we
examine later patterns (with peak latency >250ms), where there is more variability in timing and spatial
distribution of pat
terns across trials, participants, and experiment conditions. In addition, these findings point to
the need for a systematic evaluation of component analysis methods, pattern concepts, and rule definitions.

In light of these findings, we have focused rec
ent efforts on testing and evaluation of pattern classification
methods (see Frishkoff, et al, 2007 for details). Because there are well
known limitations of PCA
methods, we have been researching alternative component analysis methods, including Micr
ostate Analysis
and Segmented Regression (see Sec. 5 for details). In addition, we are testing bottom
up (data
procedures for validation and refinement of pattern analysis and classification methods, as described below.

Application of bottom
up (d
driven) methods for validation and refinement of pattern rules

Data mining provides a principled way to validate and refine the rules defined by domain experts, and can help
improve the accuracy and robustness of our pattern classification system. In t
he present context, we applied
Figure 1
. Auto
classification and labeling results from Frishkoff, et al. (2007). (A) %
ions matching criteria for each pattern. (B) Topgragraphy and (C) time
course of pattern factors.

data mining in two steps: 1)

to verify pre
defined ERP patterns or discover new ones, and 2)
to generate rules defining patterns that a domain expert could understand and help to evaluate.


Clustering involves the partitioning of data into subsets, or clusters, such that the data in each
cluster share some common features. For our initial studies, we used the Expectation Maximization algorithm
(Dempster et al., 1977) for clustering analy
sis, to verify pre
defined ERP patterns. Input to the algorithm
consisted of 32 metrics for each factor, weighted across each of the 144 labeled observations (total N=4,608).
Pattern labels for each observation were a combination of the autolabeling result
s (pa
tern present vs. pattern
absent for each factor, for each observ
tion), combined with typicality ratings as follows. Observations that met
the rule criteria (“pattern present” according to autolabeling proc
dures) and were rated as “typical” (rating
were assigned to one category label. Observ
tions that

failed to meet pattern criteria (“pa
tern absent”)

were rated as atypical (‘1’ on rating scale), or both, were assigned to a second category. The combined
labels were used to capitalize

on the high reliability and greater sensitivity of the typicality + presence/absence
ratings, as compared with the pre
ence/absence labels by themselves (see Rong, et al., 2007 for details).

For the Expectation Maximization procedures, we set the

of clusters to be 9 (8 patterns + nonpattern). We then
clustered the 144 observ
tions derived from the pattern
factors, based on the 32 metrics. The assignment of
tions to each of the 9 clusters largely agreed with the
results from the top
down (au
tolabeling) proc
dures. Ideally,
each cluster will correspond to a unique ERP pattern.
However, as shown in Table 1, inaccur
cies in either the data
summary (PCA) procedures, or the expert rules, or both, can
lead to pattern splitting. Thus, it is not surp
rising that patterns
in our clustering analysis were occ
sionally assigned to two or
more clusters. For instance, the P100 pattern split into two clusters (Clusters 4 and 5), consi
tent with the
autolabeling results.

Decision trees are pre
dictive models, mapping the attributes of data points to inferences about
their class labels. In two published papers, we describe application of decision tree learners (Weka J48, C4.5
algorithm) (Quinlan, 1993) to the attribute vectors of data that had b
een labeled by the clustering process. The
rules that were generated can be compared to rule criteria established by domain experts to discover which
spatial, temporal and functional attributes are most important in defining known ERP patterns (see Rong, e
t al.,
2007 and Dou, et al., 2007 for details). Observations in each cluster can be labeled with cluster names without
considering the experts' labels. Once the clustering process becomes more robust, this will obviate the need
for manual labeling, providi
ng a considerable savings in time and a gain in information processing.

One advantage of using decision trees is that we can generate rules automatically and use these results to

extend and refine the rules generated by domain experts. For example, result
s from (Rong, et al. 2007) show
that rules that were discovered through data mining largely agreed with rules generated by domain experts
(e.g., for the P100v, see (*) above). In addition, cluster
based rules pointed to the importance of refining spatial
etrics in the auto
generated rules. In particular, Expectation Maximization clustering results suggested that
the use of a spatial template can be useful for ERP pattern detection (Frank & Frishkoff, 2007).

Progress on Specific Aim 2: Analyses of EM patte
rns in source (anatomical) space

As described in Section 2, the use of dense
array EM measures has led to major advances in characterization
of spatiotemporal EM patterns. In fact, topographic analysis is necessary for accurate description of EM
since multiple brain areas may be active at any one time, inspection of time series provides a limited
view of the component processes that are active within a given time interval (Frishkoff, et al., 2004; Tucker, et
al., 1993). On the flip side, multichan
nel data leads to more complexity and greater variability in methods for
data representation and analysis (see Section 3, Challenges). This complexity is alleviated, to some degree, if
scalp topographic patterns can be mapped to the cortical surface for ea
ch individual subject, since variability in
brain geometry leads to superficial (rather than meaningful) differences in scalp topography (Scherg, et al.,
2002). Morover, once the data have been mapped to the cortex, spatial representation becomes much sim
than in scalp space, as researchers can refer to coordinate systems that are commonly used for representation
Table 1
. Clustering results (NP, non

of brain imaging data (e.g., Brodmann Areas, Talaraich coordinates), and differences in sensor layouts are no
longer a source of variance acr
oss datasets. Ultimately, analysis of patterns in source space can help establish
a “ground truth” in identification of EEG and MEG pattern rules and concepts, and can provide a foundation for
integration of EM patterns with results from other modalities,
such as PET and fMRI.

Despite these many advantages, mapping between sensor
level and source patterns is a difficult problem, and
a variety of solutions have been proposed (see Michel, et al., 2004 for a review). The difficulty with source
localization is

summarized by the “inverse problem”: EEG and MEG are limited by the ill
posedness of
electromagnetism (a consequence of Maxwell's equations), which states that there is no unique way to
determine the current distribution in the brain from surface measurem
ents. Therefore, modeling approaches to
the inverse problem are required, and a number of techniques have been developed that are effective in
different contexts (Michel, et al., 2004). These techniques reflect different choices with respect to mathematica
and anatomical assumptions and constraints on forward models and on inverse solutions. In addition, high
precision source solutions depend on the use of accurate head models and precise sensor registration (Michel,
et al., 2004). In this section, we desc
ribe our prior work in each of these areas.

Head Modeling.

Measurement and representation of head geometry, tissue segmentation, and biophysics are
important factors that affect source modeling accuracy and complexity. Our work in the NIC has focused on
uilding full
geometry, full
physics head models for individual subjects. We have developed

algorithms for MRI tissue segmentation (Li, et al., 2006a) and full
resolution, finite difference models (FDM)
that can

estimate human head electrom
agnetic dynamics (Salman, et al., 2005). Our conductivity modeling
(Salman, et al., 2006; Turovets,

et al., 2006) uses electrical impedance tomography and multi
search strategies to determine impedance values for up to

14 tissues (Salman, et al
., 2007). To build accurate
linear inverse solutions, we have also created advanced cortex extraction techniques (Li,

2007) that use
morphological knowledge for topology correction (Li,

et al., 2006b).

Sensor registration.
We have also developed a method

for sensor localization

that is fast, accurate, and easy
to use (Russell et al., 2005). Multiple

cameras are arranged in a geodesic array, and images of the sensors on

subject’s head are acquired (requiring ~1 min.), allowing for reconstruction of 3D
sensor positions. The
accuracy of the photogrammatic

method, quantified as RMS of the measured positions and the actual

positions, is similar (mean error 1.27 mm) to the standard electromagnetic method

(mean error 1.02
mm). These results show that ph
otogrammetry can lead to fast and accurate registration of sensor positions.

Spatiotemporal dipole analysis
. In prior work, we have adopted an approach that is well
suited to the
representation of data in source space. Rather than conducting source analys
is of an effect after it has been
identified in the scalp data (a widespread current practice), we identify a “source montage” of cortical regions
such that we are able to construct a complete representation of spatiotemporal patterns in source space. This

approach was pioneered by Michael Scherg in the Brain Electrical Source Analysis (BESA) software (Scherg,
et al., 2002) and has been implemented in our laboratory (Frishkoff, et al., 2004; Frishkoff, 2007). One
advantage of this approach is that it does n
ot rely on identification of experimental effects via difference
measures (subtraction of brain activity in two conditions). Subtraction can lead to misinterpretation (or
“misallocation”) of pattern variance (Frishkoff, et al., 2004). A corollary advantage

is the ability to describe the
entire network of brain dynamic activity that gives rise to spatiotemporal patterns which are superposed on the
scalp surface.

Distributed linear inverse solutions
When implemented with a cortical distributed linear invers
e, the
statistical reliability of the source solution can be examined with voxel
wise methods such as Statistical
Parametric Mapping (SPM), which was designed for analysis of voxel effects in PET and fMRI. In ongoing
work, we are validating the source est
imates from the 256
channel scalp data in rel
tion to intracranial
recordings of seizure onset in epileptic patients undergoing long term monitoring for neur
surgical resection of
the epileptic brain tissue (Holmes, et al., 2004; in press; Tucker, et al.,
in press). R
sults so far show that the
distributed linear inverse solutions are accurate in relation to intracranial recordings when artifacts in the data
are minimized, when the appropriate (moderate) regularization is used, and when the forward model is

computed with the accurate (FDM) resistivity model of head tissues. A key advantage of the distributed
inverse solution over the equivalent dipole method is that there is typically ele
trical activity in multiple brain
regions, even though it may not ap
pear to be of interest (in relation to a seizure or a language process). If this

distributed activity is not accurately modeled, it will alias onto the few sources that are used for the solution. In
addition, for the purposes of a co
based ontology,
the distributed linear inverse approach provides
estimation of activity in all cortical regions, such that classification can proceed in a manner directly analogous
to the statistical analysis and database re
erencing that is now well described in the PET
and fMRI literature.

. MEG localization rests on similar principles as EEG source mapping. For both modalities,
source localization requires a complete description of the “forward physics,” describing how neuronal currents
lead to electromagne
tic fields detected at the sensors.

shown in Figure 2, EEG and MEG measures are
exactly complementary (right
hand rule), providing synergistic insights on the distribution of EM patterns and
their underlying source generators. There are, however, import
ant differences between the modalities. The
forward model for MEG must include the
effect of the superconducting surface on
the fields generated by primary sources, as
well as fields generated by Meissner
currents in the SIS. Similarly, accurate
source loc
alization of EEG requires good
estimates of conductivity across regions of
the skull and scalp (Salman, et al., 2006;
Turovets, et al., 2006).

Dr. Frishkoff is collaborating with Dr. Bagic
(director of the CABMSI at Pittsburgh) on a
pilot study to acquire

head EEG and
MEG data simultaneously, using auditory
and visual target detection paradigms.
These experiments are designed to support
resolution source solutions, and to
calibrate methods for analyses of later,
related patterns (Aim 4)

Progress on Specific Aim 3: Development of EM ontologies and ontology
based data modeling

As described below, we have already built a first
generation ERP ontology and a prototype tool for ontology
development (OntologyMiner), as described in our rece
papers (Dou, et al., 2007; Rong, et al., 2007).

Ontology Development as Ontology Mining:

development is an interactive process involving ontology
engineers and domain experts. We have used data
mining results as the main references to define
an ERP
ontology. The four data mining procedures and their
interactions are shown in Figure 3, and outputs (e.g.,
classes, class hierarchies, etc.) are included in an ERP
ontology. This is a semi
automatic framework because we
need expert input to assign m
eaningful names to the

Our ontology mining framework is based on the
assumption that there exists a domain ontology (i.e.,
semantics) for a set of data instances in some specific
domain (e.g., ERPs from psycholinguistic studies). Our
goal is to
find what classes, properties and axioms can be
mined to compose that domain ontology. From a machine
learning point of view, the domain ontology is the target
function to be learned, and it includes components such
as classes, class hierarchies, propertie
s, and axioms. A




Figure 2.

(a) Butter
fly plots of 256 EEG and 148 MEG show
matosensory evoked potential. (b) Topographies at 20 ms show
the electrical field inversion over sensorimotor cortex (top) and the
corresponding right
hand rule magnetic field (bottom). (c) Single
equivalent dipole

source for EEG was in the central sulcus (top) and
for MEG was posterior to the postcentral gyrus.

Figure 3.
A framework for mining domain ontologies

reasonable assumption is that data instances that belong to the same class are similar by virtue of sharing
certain properties, and (conversely) that data instances belonging to different classes are dissimilar.
Determination of the classe
s to be included in an ontology can therefore be considered as a clustering
problem. Discovering the hierarchy of classes (clusters) is a hierarchical clustering problem. On the other
hand, deciding which properties and values are shared by data instances
in the same class is a typical
classification problem. The selection of attributes for classification (e.g., information gain selection) can be
used for property selection. The classification rules can also be treated as the relationships (axioms) of
rties and classes. The association rules between different properties can be treated as relationships
(axioms) between these properties, which are also used for rule (axiom) optimization.

The combination of top
down (expert
driven) and bottom
up (data
ning) approaches has led to a first
generation ERP ontology, which consists of 16 classes, 57 properties and 23 axioms (Fig. 4 is a partial view of
the ontology). Figure 4
illustrates five basic classes, i.e. factor, pattern, channel group, topography and
measurement. Pattern “objects” have temporal, spatial and functional attributes (e.g., EVENT, SP
cor, MOD
etc.), which are represented as properties of the pattern class. TI
max and IN
mean(ROI) are properties of
measurements (e.g., time sample, voltage),
which have both unit and value properties. The pattern class has 8
classes, which correspond to 8 ERP patterns defined by domain experts and verified through data mining.
Properties of the pattern class are those used in expert rules or rules discovere
d by data mining. The expert
rules are represented as Horn rules whose body consists of conjunctions of predicates. The relationship
between factors and patterns can be modeled using the “occursIn" property. Each pattern has a region of
interest (ROI), whi
ch is a channel group belonging to the topography class. For instance, left occipital (LOCC)
and right occipital (ROCC) are sub
classes of channels. The intensity mean for a region of interest is
calculated based on this relationship.

In addition to graph
ic representation, we represent our EM ontology internally using formal ontology languages.
Web Ontology Language (OWL) works well to specify classes, class taxonomy and properties in our ontology,
and for axioms (such as classification rules), we can use
SWRL (http://www.w3.org/Submission/SWRL/). After
some further revisions, we will submit our EM ontology to the National Center for Biomedical Ontology.

based Data Modeling.

EM ontologies can be used to answer semantic queries. To this end, we
ve designed an ontology

database for the ontology represented in Figure 4. We are particularly interested in
several properties:
correctness, space,
time, and query
answering speed.
We used the data
from a visual word
study, in which there
were 34 hu
subjects, 25 different
dimensions in the
attribute space and a
vector of 1152
component factors.
In essence, we were
working with a matrix
of data consisting of
1152 rows by 34
columns. For every
class in the ontology,
we defined a unary
relation, and
every property a
binary relation. For
Figure 4
. A partial view of our first
generation ERP ontology.

every dependency (ID, TGD, or EGD) inferred from the ontology specification, we generated the corresponding
foreign keys, triggers, and primary keys in the database. Finally, for every data instance, we generated a

unique internal identifier. Altogether, the matrix consists of 100,425 individual facts (ground terms).

The ontology database experiments were performed on a personal laptop computer with a 1.8Ghz Centrino
processor and 1Gb of RAM running MySQL 5.0 as th
e RDBMS. It took approximately 14 seconds to generate
the database schema based on the ontology and load it into the MySQL RDBMS. It took 1.3 hours to load all
the individual facts.
We have tested our implemented ERP ontology database with queries defined
collaboratively by our computer science and neuroscience researchers. Queries such as the following were
answered successfully by our ERP ontology database:
Show all patterns whose region of interest is left
occipital and whose peak latency is between 220
and 300ms
. To our knowledge, there is no other database
design that can automatically answer queries of this sort. Our system may be the first to utilize ontologies and
databases to successfully answer meaningful scientific questions for electromagnetic da

Progress on Specific Aim 4: Prior work on the neural bases of reading

Our targeted research domain is reading
related processes. Prior work has shown remarkable variability in
reading skills among people of all ages and demographic backgrounds. This v
ariability has important
consequences for educational and socioeconomic outcomes (e.g., Perfetti 1985, 2003). One prerequisite for
learning to read is the ability to map letters to sounds, i.e. orthographic decoding. This is not a “natural” ability,
but on
e that must be acquired, and research suggests that decoding skills are developed most reliably through
direct, targeted instruction (Perfetti, 1991). Recent brain imaging work has examined neurolinguistic processes
that underlie learning of these critical

skills and on activities that can support this learning. These studies have
converged to identify one brain region, in particular, as integral to reading development, namely, the left
fusiform gyrus or “visual word
form area” (VWFA). Tasks requiring ortho
graphic decoding (e.g., nonword
reading) strongly activate the VWFA in healthy, but not in dyslexic, participants (Shaywitz, et al., 2002).
Further, studies suggest that VWFA may be “tuned” by phonological processing, via connections between
VWFA and regio
ns implicated in phonological processing, such as left superior temporal and inferior frontal
cortex. This idea is in line with theories of reading disability that have focused on the necessity for fluent
processing of sound patterns, as a prerequisite to
decoding of unfamiliar visual word

N170 topography varies with perceptual “expertise.”

A pattern known as the N170 (or M170) has recently
received a great deal of attention, as a possible analog of the VWFA effects seen in neuroimaging studies
Candliss, et al., 2003; Maurer & McCandliss, in press). The N170 is a negative deflection in response to
visual stimuli, which is maximal over occipital
temporal electrodes between ~150 and 220ms (Fig. 5). Source
localization studies have identified occipi
temporal regions (including the fusiform gyrus) as the probable
locus of the N170 (Maurer, et al., 2005; Frishkoff, et al., 2004; Brem, et al., 2006). Further, dense
studies have shown that the N170 topography reliably differentiates words and p
ronounceable nonwords from
strings of familiar symbols (Bentin et al., 1999) and from unfamiliar symbols or “false fonts” (Maurer, et al.,
2005): while the N170 to words is markedly left
lateralized, the N170 to nonlinguistic symbols and pictures is
ral or right
lateralized in healthy adults (Maurer, et al., 2005; Brem, et al., 2006).

Recent studies
suggest that the
N170 to words
(versus false fonts)
develops with age
(Brem, et al., 2006)
and expertise
(Wong, et al.,
2005). Resear
by members of our consortium, and others, is examining whether this specialization develops along with
acquisition of reading skill and whether it transfers from familiar to novel word
like stimuli (e.g., Brem, et al., in
progress). If so, this pattern
could serve as a useful index of the development of orthographic decoding skills.

EM markers of phonological skill.
Speech sounds evoke patterns linked to phonological processing. Two
such patterns are the Mismatch Negativity (MMN) and Phonological Mismat
ch Negativity (PMN). The MMN is a
negativity that is evoked by unexpected (“deviant”) auditory stimuli ~150

250 ms after stimulus onset; it has a
frontal distribution, with polarity inversion about the mastoids. Source localization of MEG data (Alho,
995) has suggested that the MMN may be generated in superior temporal cortex, with additional mid
sources possibly linked to orienting of attention. Some (but not all) recent studies have reported reduced or
absent MMNs in dyslexic adults and child
ren (e.g., Lachmann, et al., 2006; but cf. Huttunen, et al., 2007).
Some important, and unresolved, questions include the role of attention (and by implication, frontal generators)
in the MMN response, differences (if any) in the MMN to linguistic versus n
onlinguistic auditory stimuli, and the
locus of phonological and/or core auditory deficits in dyslexic subjects who show a reduced MMN.

The PMN is a negativity at ~250
400ms that is evoked by phonological processing (Connolly, et al., 1995). The
PMN has a

similar scalp distribution as the MMN, though source estimations have placed generators anterior
to (i.e, upstream from) primary auditory cortex (Kujala, et al., 2004). One important question is why the PMN is
not consistently seen in response to visual w
ordforms, even with tasks that place strong demands on
phonological processing (Connolly, personal communication). In Section 5, we describe two tasks that are
designed to evoke PMN and N170 responses in the visual and auditory modalities. This will suppor
t the aim of
mapping reading
related patterns across modalities. These studies will initially be conducted with healthy
adults and later extended to include healthy and reading
impaired adults and children (ages 5 and above).

Progress on Specific Aim 5: O
ntology Mapping, Merging and Data Integration

In work leading up to this proposal, PI Dou developed Web
PDDL (McDermott & Dou, 2002), a strongly typed
order language for representing ontologies and the mappings between them. Because different people,

laboratories or organizations may have distinct interpretations of similar concepts, one of the most difficult
problems in information sharing regards the semantic differences between different ontologies. The problem
involves much more than a choice of w
ords. We can expect these problems to be exacerbated in the
neuroinformatics domain, which requires just the sort of highly expressive knowledge representation and
mappings that Web
PDDL was designed to support. Although we have adopted the Web Ontology La
(OWL) to define our first
generation ERP ontologies, it remains an open question, which language will be the
standard for semantic (ontology) mappings, although it should be a subset of first order logic. Until the
proposed standards for semantic ma
ppings (Stuckenschmidt & Uschold, 2005) are finalized, we will use Web
PDDL to describe the mappings. We also can use SWRL or RIF(http://www.w3.org/2005/rules/) to represent
like mappings, but there is currently no SWRL or RIF inference engine that ca
n directly support ontology
merging and data integration (cf. Dou et al., 2006). In the future, we envision that Web
PDDL will function
mainly as an internal language with wrappers supporting future standards (e.g., RIF).

By using Web
PDDL as an internal

language to describe mapping rules between ontologies, we use
OntoEngine (Dou et al., 2005) to implement data translation and query translation across data resources:



Figure 5.

ERP response to a visually
presented word (~140

240ms). Maps shown at
~20ms intervals. Green = negative; yellow = positive. Data from Frishkoff, et al., (2004).

Early N1

Late N1/N2

Dataset translation
: Translating data from a source ontology to a target ontology for t
he purpose of total
information exchange. This task was implemented mainly by forward chaining on source data facts plus
mapping rules, retaining only conclusions that are facts expressed in the target ontology.

Query translation
: The process of extracting

data expressed in one ontology to answer the query in
another ontology is implemented by backward chaining on the query plus mapping rules from a source
ontology, to reformulate queries into that of the target ontology.

In earlier work, we developed the
OntoMerge system to perform query answering and data translation across
Semantic Web ontologies (Dou et al., 2003; Dou et al., 2005). More recently, we have develped extensions for
database applications and Semantic Web queries in e
Commerce (Dou & LePendu
, 2006; Dou, et al., 2006).

Progress on Specific Aim 6: Neuroinformatics Infrastructure to support NEMO research collaboration

The NEMO project will be supported by an integrated environment of tools for storage and analysis of ERP
data, plus web
based te
chnologies for consortium collaboration. Our expertise in high
performance computing
and distributed systems will be important in creating the necessary infrastructure and tailoring it to the needs of
our partner labs.
l projects

demonstrate the research background we bring to neuroinformatics
problems related to the NEMO objectives.
In addition,
we describe

our recent work on the NEMO portal.

specific environments

The success of environments for
collaborative research depends on how well the requirements of the scientific
domain are supported by environment tools, thereby reducing the complexity of problem solving activities for
users and increasing productivity. For example, researchers who are n
ot experts in computing or database
systems may need high
level support for managing data, specifying analysis workflows, conducting meta
analyses, and visualizing results. Our
work on the Virtual Notebook Environment (ViNE) (Malony, et al.,
2000) us
ed the metaphor of a laboratory notebook to provide a high
level interface for scientists to link their
tools and data within a web
based system. ViNE supports the distributed execution of computational processes
by invoking tools on computational servers
and accessing and moving data as required. Figure 6 shows the
ViNE architecture and a sample notebook for EM experiment management and data analysis.

Figure 6
. ViNE Architecture and EEG Analysis Notebook Example

performance analysis of dense

EM data

Of course, r
obust tools for data analysis and evaluation are central to building a neuroinformatics environment
of high utility. Signal and statistical analysis of dense
array EM data extracts features that can be used to
understand dynamics of b
rain function. For example, component analysis techniques applied to multi
series data are capable of decomposing signals into separable significant constituents. Component
decomposition can aid in cleaning and artifact detection

in addition to identifying components of brain
operation. PCA (principal components analysis) and ICA (independent components analysis) are two leading
techniques in the field. Our previous research has made use of a variety of statistical software soluti
ons for
processing dense
array EM data, including blind source separation methods (Frank & Frishkoff, 2007).


Automated Protocol for Electromagnetic Component Separation

is a
based framework for
ation and evaluation of methods for ERP data cleaning and component analysis (Frank & Frishkoff,
, 2007
; Glass, et al., 2004).
It was developed to

alternative component analysis techniques by
applying quantitative and qual
itative criteria to evaluate the success of data decomposition (e.g., to ensure
artifacts are separated from cortical activity). APECS addresses key issues in EM data analysis, such as
superposition of components and separation of brain activity ("signal")

from noncephalic artifacts ("noise"),
which can be included as preprocessing steps in the creation of EM ontologies.

performance analysis libraries

ERP analysis tools, such as EEGLAB (Delorme & Makeig, 2004) and APECS (Frank & Frishkoff, 2007), util
various procedures to process EEG data. Depending on data size and analysis complexity, these procedures
can be computationally and memory intensive. A good example is the execution requirements of the

(Bell & Sejnowski, 1995)

algorithm, which

grows in time complexity as O(n
) in the number of channels, and
O(n) in the number of time samples. For EEG data with 256 channels and several
hundred time samples,
sequential implementations can be prohibitively slow. Infomax procedures in EEGLAB

performed poorly
with large datasets because of
threaded execution and
memory management


To support computational and memory demands of high
resolution EM neuroim
ging, we developed a high
ce, parallel C++ library called

for dense
array ERP signal analysis (Keith, et al., 2006).
Although designed for a wide range of signal analysis capabilities, the initial implementation of the HiPerSAT
library consists of ICA routines based on the



(Hyvärinen, 1999) methods previous

implemented in Matlab. Compared to the sequential implementation of these algorithms (in Matlab and C++),
the parallel implementations achieve significant speedup (Keith, et

al., 2006). This improved performance will
be critical as we apply these methods to analysis of large
scale EEG
datasets for th

parallel SOBI and wavelet methods have also been implemented.
HiPerSAT parallel computation

is currently being

to use the
open source
for signal processing

intend to

our HiPerSAT
to the IT++ community.

Distributed computing with Matlab

Building high
performance libraries like HiPerSAT can alleviate
Matlab performance and memory issues,

Matlab must be able to interact with external parallel systems to tak
e advantage of them.
Malony and
colleagues have
developed a package for Matlab called

Matlab Con
current Runtime Engine
) that
allows it to run concurrent tasks in a distributed computing system (Hoge, et al., 2006).
Given a computing
Mcore with APECS
run simultaneous ICA tasks externally on multiple HiPerSAT servers and
ncurrently with other APECS operations. We have demonstrated APECS with 20 simultaneous HiPerSAT
servers running parallel ICA tasks. Compared to sequential ICA processing of 20 one
GB EEG files in
EEGLAB, we see a 5x speedup for HiPerSAT on one data set, a
nd a 10x speedup from running 20 HiPerSAT
tasks together, for a 50x total reduction in processing time. We have also demonstrated APECS clients running
simultaneously on multiple workstations and accessing the HiPerSAT servers on different parallel platfor

Portals for scientific collaboration

During the past two years, the NIC has been exploring web portals as a means to provide shared access to
ERP data.

APECS portal

was developed
for enabling scientists to share EEG data, jointly
data analysis tasks, and view the component results (Hanson
Smith, 2007).
The system
components of the APECS portal are shown in Fig
ure 7. Collaboration is supported through file sharing, tool
access, and project reports. Scientists can upload files through the portal and use those files in analysis
workflows. The portal recognizes diverse file types, including EEG data, blink template
s, and contextual meta
data. A terabyte storage system is integrated in the portal design to save both raw and processed EEG
waveforms. The APECS portal is a user
centric environment using
portlet and web
technologies, with tools for scie
ntists to jointly compose, execute, and manage signal analysis projects.

Figure 7
. Prototype APECS Portal Architecture and Job Creation Interface

Based on the

experience gained
in the

APECS portal development, we
have created a prototype





our consortium partners

ess to


ERP data, metadata,
and analysis metrics from the

pilot language study. The NEMO portal demonstrates the use
open source






and content management framework (


management system

preliminary work provides

an excellent foundation for extending the functionality
needed for
meeting the
collaboration, data sharing,
analysis, and


in the full NEMO

In addition,
ur partnership with the HeadIT

project (assuming it is funded) will provide BIRN
node resources
integrate the NEMO portal
in a
fully compatible
with BIRN