KNOWLEDGE INTEGRATION & VISUALISATION

kettlecatelbowcornerAI and Robotics

Nov 7, 2013 (3 years and 7 months ago)

91 views


26th Annual Symposium on Chemometrics

Vondelparc, Utrecht

May 20th, 2010


KNOWLEDGE INTEGRATIO
N &
VISUALISATION


Program

9.30

Registration


coffee

10.15

Renger Jellema

(DSM)
-

Opening

10.30

Patricia
Men
endez

(Biometris)

Penalized regression

11.00 Coffee/tea break

11.30

Huub Hoefsloot

(UvA)

Biological networks

12.00


Johan de Rooij

(
Erasmus MC)


Sparse n
etworks estimat
ion

12.30 Walking lunch / Poster session

13.45

Martin Theus

(
Telefonicá
-
O2
, Germany)

Plot and Look
!

Trust your data more than your models

14.30

Henk Kiers

(Groningen
)


Visualizing results of two
-
way and three
-
way data analysis

15.00 Coffee/tea break

15.30

Anthony

La Grange

(Stellenbosch University, South Africa)

Interactive biplots in R

16.00

Klaas Faber

(2CY)

The Biological Passport from a chemometrics’ perspective

16.30 Winning Poster Announcement

17.45 Closure and drinks



Patricia Men
é
ndez
: Penalized regression techniques for modeling relationships
between metabolites and toma
to taste attributes

Biometris, Wageningen

University


Patricia Menendez is a postdoc at Biometris, Wageningen. She got her Ph.D at the
Seminar for Statistics, ETH, Zurich.


The search for models which link tomato taste attributes to their metabolic profili
ng is
a main challenge within the breeding programs that aim to enhance tomato flavor.
Under the framework of the second food quality project organized by the Center for
BioSystems Genomics (CBSG), we investigated the relationships between metabolic
data a
nd tomato taste attributes. In particular we compared models obtained by the
traditional statistical approach stepwise regression, with those computed by the new
generation of regression techniques, known as penalized regression or regularization
methods.
In addition, for penalized regression, different scenarios and various model
selection criteria were discussed to conclude that, classical cross
-
validation selects
models with many superfluous variables, whereas model selection criteria such as
Bayesian in
formation criterion seem to be more suitable, when the goal is to find
parsimonious models, to explain tomato taste attributes based on metabolic
information. An exhaustive comparison of the discussed methodology will be presented
for a number of sensory t
raits, showing that the most important covariates were
identified by the stepwise regression as well as by the penalized regression methods,
despite the general disagreement on the size of the regression coefficients between
them. In particular, for stepwi
se regression the coefficients are inflated due to their
high variance which is not the case with penalized regression, showing that this new
methodology, can be an alternative to obtain more accurate models.


<< back



Huub Ho
efsloot:
Biological networks: from graph theory to regression

Biosystem Data Analysis Grou
p
, UvA

Huub Hoefsloot holds an MSc in Mathematics (1987) and a PhD in Chemical Engineering
/ Numerical Mathematics (1992). In 1991, he started as an assistant profess
or at the
Universiteit van Amsterdam; his main research area is building and validating
(dynamical) models of complex processes. In 2000 he became associate professor and
in 2001 he joined the Biosystems Data Analysis group.

Various biological networks an
d their interactions will be introduced. Are chemometric
tools suited to analyze data that are a result of these complex networks? In my opinion
the answer to this question is no. So how can we improve and modify standard tools in
order to analyze biologic
al problems and create and output that is useful for the
biologist.


<< back


Johan de Rooi:
Sparse network estimation from gene expression data
.

Department of Bioinformatics,
Erasmus
M
edical
C
entre

Johan de Rooi is a PhD stud
e
n
t at Eramus Medical Center in Rotterdam.

He got a Master in Statistics and
Methodology

at Utrecht University
.

Recently various approaches to recover gene regulatory networks from expression data
have been proposed. Often applied methods are Support Vecto
r Machines, Bayesian
networks or

methods based on information theory. From a statistical point of view it is
logical to use the covariance matrix of the genes to build the network. More elegant is
to use its inverse (aka the precision matrix), because of i
ts close relation with partial
correlations. Although the final model should be sparse, the covariance matrix doesn’t
contain any zeros. Due to the large number of variables and a limited number of
samples, the inverse often cannot be calculated.
In order
to derive an invertible
covariance matrix and reach a sparse model, shrinkage procedures are applied. The
approach we take fits a regression model on one variable in the model with all others
being the predictors. This procedure is repeated for all variabl
es.
There is a direct link
between this model and the inverse covariance matrix: a non
-
zero regression
coefficient corresponds to a non
-
zero element of the precision matrix.
From the
regression coefficients we can calculate the partial correlations.
Probab
ly the most
often used penalty is the l
1

penalty, aka the lasso. This penalty is attractive because it
does both shrinkage and variable selection by setting coefficients to zero. However in
the context of recovering sparse networks the l
1

often leaves too
many edges in the
network. In order to further reduce the number of nodes with many relations, we use
the l
0

penalty and in this way yield a model that better resembles the very sparse
nature of genetic networks. Because of the non
-
convex nature of the l
0

penalty we
adopt a two step strategy. As a first step the l
1

penalty is applied to compute an initial
solution. In the second step the l
0

penalty is put to work on the remaining non
-
zero
coefficients in the model. Preliminary results on both real data and
small simulations
show promising results.


<< back


Martin Theus:
Plot and Look! Trust your data more than your models

Telefonica O2
-
Germany


Martin Theus is Senior Project Manager in the Analysis Center of the Business
Intell
igence Unit of Telefónica o2
-
Germany. His research and application areas are
data visualization and data mining as well as exploratory data analysis. He is author of
the data analysis
software Mondrian

and
is co
-
author of the book "Graphics of Large
Datasets
”.


Statistics certainly has its main purpose in analyzing data. While mathematical
statistical methods became the predomina
n
t toolbox for statistics in the last century,
there are other powerful approach
es by now, that complement these techniques. John
W. Tukey's groundbreaking work on Exploratory Data Analysis lead to the field of
Interactive Statistical Graphics. Although research in interactive graphics dates back
more than 30 years now, the underlying

techniques and strategies are still n
ot wide
spread. The reason for this is

not a lack of tools


Mondrian, iplots and ggobi are
readily available


but a lack in teaching these techniques. Interactive Statistical
Graphics works directly on the raw data i
tself and does not impose any model
assumptions. It supports exploration which is the necessary basis for further
confirmatory steps.

This presentation will give an overview of the most important techniques and
strategies of Interactive Statistical Graphic
s and illustrate them with Mondrian.

Mondrian is a general purpose graphical data analysis tool, which offers a wide range
of graphical displays in a highly linked and interactive environment
.


[1] Unwin, A., Theus, M., Hofman, H.,
Graphics of Large Datase
ts, Visualizing a Million
,
Springer

[2] Theus, M, Urbanek, S.,
Interactive Graphics for Data Analysis
, Principles and Examples
, CRC
Press


<< back


Henk Kiers:
Visua
lizing results of two
-
way and three
-
way data analysis

University of
Groningen


Henk Kiers

is professor in Methods for Data Analysis within the Department of
Psychology at the University of Groningen. His main
areas of research have been and
to various extents still are Multiway and
M
ultiset component analysis, Least squares
based optimization procedures, Simple structure rotation, Analysis of asymmetric
relationships data and Use and usability of statistical te
chniques. He has given courses
on Multiway component analysis and Least squares optimization.


Popular two
-

and three
-
way analysis techniques (PCA, Parafac, Tucker

analysis)
b
asically come down to projection of the high dimensional data space onto low
-
dime
nsional subspaces. In plotting results from such techniques, therefore, it is
logical, but not always straightforward, to take this into account. In the presentation I
will describe how three
-
way results should and can be plotted.


<< back

Anthony La Grange: Visualising multivariate data using biplots in R

Erasmus University, Rotterdam


Anthony la Grange is a PhD student at Erasmus University, Rotterdam.

He was previously a student at Stellenbosch University, South Africa


[1]
http://biplotgui.r
-
forge.r
-
project.org/


<< back


Klaas Faber:
The Biological Passport from a chemometrics’ perspective

2CY


Biologisch paspoort: hoopvolle opsporingstechn
iek of juridisch wankel?


Emma van Laar heeft in C2W van 23 januari j.l. de diverse standpunten rond het
biologisch paspoort voortreffelijk weergegeven en de volgende twee alinea’s zijn dan
ook min of meer letterlijk uit haar artikel ‘Paspoort op glad ijs?
’ overgenomen. De
titel van de presentatie is overigens ontleend aan een tussenversie van dit artikel!

Het biologisch paspoort is een database waarin de medische gegevens van een sporter
over een langere periode worden bijgehouden. Het bevat de resultaten
van alle
dopingtesten (bloed
-

en urinemonsters) die een sporter ondergaat en geeft zodoende
inzicht in de bloedwaarden (zoals hemoglobinewaarden of aantallen rode bloedcellen)
en hormoonhuishouding (afbraakproducten van steroïdhormonen). Ongewone
afwijking
en zouden kunnen wijzen op dopinggebruik. Waar eerder een positieve
dopingtest nodig was, kan er sinds 1 januari 2009 ook alleen op basis van het
biologisch paspoort geschorst worden. Deze regelgeving heeft voor veel discussie
gezorgd.


Volgens de spreker
worden er bij vervolging op basis van het biologisch paspoort
cruciale fouten gemaakt. Dezelfde soort fouten zijn eerder gemaakt bij grote
juridische zaken, zoals die van Lucia de B. en Sally Clark [1,2]. Niet alleen wordt de
statistiek verkeerd gebruikt,
maar ook via logisch redeneren kom je tot de conclusie
dat het paspoort niet voldoende is voor vervolging. Zo kunnen er geen etiketten aan
de gemeten waarden gehangen worden, oftewel je weet niet waar de afwijking
vandaan komt. De getallen geven geen uitsl
uitsel omdat van alles en nog wat een rol
kan spelen. Vervolgen op basis van dit gebrekkig bewijs is daardoor een brug te ver.

Kortom: een afwijkend standpunt dat niet bepaald vrijblijvend is in concrete zaken
zoals die van Claudia Pechstein, waarin de spr
eker overigens adviseert. Volgens de
spreker is Claudia Pechstein in feite de Lucia de B. of Sally Clark van de sport.


[1]
M. Buchanan, Conviction by numbers,
Nature
,
445

(2007) 254
-
255.

[2]
http://www.sportknowhowxl.nl/index.php?pageid=detail&catid=OpenPodium&cntid=4366



<< back