D2.2 Data analysis and methods

yalechurlishIA et Robotique

7 nov. 2013 (il y a 7 années et 11 mois)

213 vue(s)

Project no. FP6 - 043345

Project acronym CID




Priority 7

D2.2 Data analysis and methods

Period covered: from 1-1-2009 to 30-06-2009 Date of preparation: December

Start date of project: 1-1-2007 Duration: 36

Project coordinator name: Prof. Stefano Brusoni
Project coordinator organisation name: CESPRI Final Deliverable

- 2 -

Data analysis tools and its application to
social sciences

Institute for applied Biotechnology and Systemanalysis at University Witten/Herdecke

• Introduction

• A short Detour about Theory

• Tools and Software

• Data Mining, Machine Learning and more

• Text Mining a special Case

• Some Words about Modelling

• Application to Social Sciences

• References

- 3 -


Modern “System Analysis” is based on three columns:

• Computer
• Software
• Datasets representing the system.

The computer part we don’t want to discuss. Even it is important, because speed and storage
capacity determine also in our days extent and complexity which can be analysed. The cases
we will regard here, good modern multi processor machines are sufficient enough to solve
the most questions asked under the discussed circumstances.
Software – Because of the elaborated knowledge in data - and information theory worked out
during the twentieth century in mathematics, statistics and computer sciences, an enormous
theoretical background for system analysis exists now. On this Basis modern stochastic,
statistical, numeric and logic techniques were developed, to get as much as possible
Information, or lets call it knowledge, out of data sets to unravel complexity. From this
validated techniques, software tools were developed, which might be represented in libraries,
programming surroundings, specialised programmes etc. Later we will discuss these
toolboxes in more detail.
Data Sets – The aim of analysing a system or a subsystem is, to be able to describe it and
understand its behaviour. Therefore one needs to collect as much information as possible with
the highest available precision about this system or subsystem. Immediately here, in this short
description of the aim, we see the ambiguity in the engagement of understanding systems.
Which information do we want? All we can get? But than - we loose precision. Or do we want
only part of the information which might be more “precise”? Than we lose information,
because we need to filter the data. Or if we acquire new data - for which purpose? To get
more information or to increase precision? If we reduce the information set we might lose
important parameters describing the system, its structure and its behaviour. Do we lose
understanding the system? Do we even get to the point of misunderstanding it? If we now,
aside the question of information against precision, raise the question how and how much to

- 4 -
reduce complexity – to which amount, in which direction? – The answers become more and
more difficult.
The theory of system analysis tries to approach these problems, but one will never come to a
definite answer; one gets multiple answers. We have to get use to this in understanding
complexity. As higher the complexity as higher the dimensionality as higher is the number of
possible answers with similar probability and precision. But let us discuss how one can deal
with high complexity.
Every system “a” is a subsystem of a higher system “A”. Therefore we have to give first a
description of the system “a” and its dependencies to the higher system “A”, as far as we
know it. Still there will be a subset of unknown parameters and relations from which we hope,
that these are not relevant, at least for the description of that part of the system “a” with a
precision “pa” we want to understand (but we keep also this subset always in mind).

Figure 1
Core-System X

- 5 -
Than we try to reduce the system “a” to the less complex subsystem “a’ ” and describe the
new dependencies to “a” again. In the next step one reduces to “a’’ ” and so on until as shown
in next graphic, one get to a “Core-System X”, which is as close to the target questionnaires
as possible and is represented most by the original data set one has.

What becomes obvious now is that there exists not only one technique, algorithm or program
which can do this. Each step has to be optimised and the tools have to be composed the right
way, to get to a reliable convergence solution. Even then, there are also for each step several
different ways which are possible. Figure 1 is a copy out of WIKIPEDIA and gives a good
insight into the topics of Machine learning, which is one of the techniques used in

In reality there is never “the solution”. It is like climbing one of the highest mountains in the
Himalaya. Every mountain climber has its own track depending on his individuality and his
abilities; but at least depending on hardware and whether.
The way we climb our system, is the holistic way. We first look to the full scene, then
reducing the “mountain” to its “best part”, by analysing different areas of it leading to the top.
On this learnt basis we climb the first part of it and analyse again, to verify the decision or
take another way.
To make it more clearly how we act, we need first a little detour into theory. Than we
describe the tools and the software, which give the results in the description of the System
leading later to a first approximate model.

- 6 -

Machine learning topics
This list represents the topics covered on a typical machine learning course.
• Bayesian theory

Modeling conditional probability density functions
and classification

• Artificial neural networks

• Decision trees

• Gene expression programming

• Genetic algorithms

• Genetic programming

• Inductive Logic Programming

• Gaussian process regression

• Linear discriminant analysis

• K-nearest neighbor

• Minimum message length

• Perceptron

• Quadratic classifier

• Radial basis function networks

• Support vector machines

Algorithms for estimating model parameters
• Dynamic programming

• Expectation-maximization algorithm

Modeling probability density functions
generative models

• Graphical models
including Bayesian
and Markov Random Fields

• Generative Topographic Mapping

Approximate inference techniques
• Monte Carlo methods

• Variational Bayes

• Variable-order Markov models

• Variable-order Bayesian networks

• Loopy belief propagation

• Most of methods listed above either use optimization
are instances of optimization algorithms

Meta-learning (ensemble methods)
• Boosting

• Bootstrap aggregating

• Random forest

• Weighted majority algorithm

Inductive transfer and learning to learn
• Inductive transfer

• Reinforcement learning

• Temporal difference learning

• Monte-Carlo method

Figure 2

- 7 -

A short Detour about Theory

The software tools used are implementations of statistical laws and mathematical algorithms.
Beside the standard statistical methods, Bayesian statistics is the basis of most algorithms in
Data Mining (Text Mining), Information Retrieval or Machine Learning and all the other
fields of intelligent automatic data analysis. Elementary probability deals with probability
spaces, portability function, probability distribution, conditional probability and independence
etc.. These are all expressions of orthodox frequentist statistics. At this point I don’t want to
go too deeply into mathematical formulations. Who is interested in a deeper understanding of
the statistical basis might find reasonable explanations in standard statistical literature.
Here I want to give a little example to introduce and explain the aim of Bayesian statistics.
Imagine you toss a coin and get eight heads and two bottoms by ten tries. So the outcome is
eight heads out of ten, what is called the maximum likelihood estimate. In contradiction of
your expectation, you didn’t get five hats out of ten. This would be correct if the coin would
be perfectly weighted and you throw it always the same way. Because you are surprised of the
result you make a series of different tries. You throw the coin 20 times with an outcome of
a>10out of 20, 40 times with a result b>20 out of 40, 60 times (c>30 out of 60) and so on. The
description of such an experiment and the statistical outcome can be handled by the Bayesian
statistics. You might have thought that the difference of fewer expectations to the real result
comes because of the small number of samples. In other words you have a prior believe that
influences yours believe even in the face of appearance evidence against it. Bayesian statistics
measure degree of believe, and are calculated by starting with prior believes and updating
them in the face of evidence, by use of Bayes’ theorem. Because of this it is the basis of a lot
of algorithms which can prove the probability of a model due to a variety of alternative
hypotheses. Therefore in all of the tools and software we describe later, one of the basis are
the Bayesian statistics. For further reading a more mathematical short introduction to
Bayesian statistics is given in the Book of Manning “Foundations of Statistical natural
Language Processing”, which describes the Bayesian statistics more than only the aspects in
language processing.

- 8 -

Tools and Software

In the project CID a lot of different tasks are dealing with the question of cultural influences
to innovation diffusion. We have historical topics as well as medical or commercial ones. The
aim of IBiS is to find out laws or rules in each of these topics or even between them. So the
task for IBiS is divided into a four different sub tasks:

1. Importing and converting the different kind of data sets from the different

2. Built up consistent databases

3. Pre analyze the data for classification and getting preliminary rule sets

4. Try to find rules and laws for modelling.

Importing and converting

As shown in figure 2 the data handling for modelling is quite complex. Especially in Social
sciences, but not only there, very often we have to deal with different sources of data. The
data come from different questionnaires, other databases, from literature, and so on.
Sometimes the data are extracted from the WEB or exist as WEB links. For an extensive
modelling we need all the data we can get about the system to be analysed, because in each of
the data, even if they have different forms, their content of information helps to understand
the structure and behaviour of the system. But if one leaves the data in the different forms one
can’t compare them, at least not with machine learning and other computational techniques.
The data have to be made comparable. Especially if one will include information out of
textual data in the analysis, one needs to extract the information out of the text and transform
it into a homogeneous and with the other data comparable form.

- 9 -

Figure 3 Data handling for consistent Modelling

The easiest datasets to handle are original statistical data prepared for SPSS in the “SAV-
format” (name.sav). We choose this format and developed a standard converter, because each
questionnaire in the CID project is analysed statistically via SPSS. Also the most data from
public or other databases can be achieved in this format.
More complicated is the situation in the case of textual data. Even the most common used
text-formats have to much additional information, which has nothing to do with the pure
content of the text (like: paragraph formatting, letter format, page format etc.). So it was
advisable for the standard application to use a plain text format. We chose the”TXT-format”
(name.txt) as the basis for importing text information. Later if we describe text mining we will
discuss this in more details.
In both cases, the more formal data like statistical data or the textual data, all are converted in
a form, which can be handled by the toolkits of data mining, text mining, information retrieval
and all the other intelligent software we use for analysing complex systems.
Original Datasets
Converter to
Data Mining Format
Converter to
Text Mining Format
Statistical Dataset (-.sav)
Textual Dataset (-.txt)
Index Database
For Text
Text Mining Tools
Database for
Data Sets
Data Mining Tools

- 10 -

Consistent databases
As remarked above data are organised in different ways, and we try to use this data in all
forms, but if we want to compare this data and want to analyse their structures and
dependencies we need formats which fulfil conditions which make them comparable. But it is
not necessary to have the same format; consistency means for us, that all data formats can be
processed together in the set of algorithms we use for analysis. So Consistency is different at
different stages of data analysis. Let’s give an example. In a questionnaire we have scaled
data and some answers with “Yes” and “No” this is in most cases for us a consistent dataset.
Only if we have a pure numerical tool we need to transcript “Yes” into “1” and “No” into 0.
The same would be necessary if we want to use different questionnaires in a combined
analysis. Than we have to homogenise the data like different scales in the same dimension
(one might have a scale 1-5 the other 1-10, or “Yes” and “No” and the other “1” and “0”, and
so on). These are simple examples in praxis there are much more complicate one were
consistency has to be proven in a mathematical way. So we very often have to create
temporary databases which fulfil the special conditions needed for a special step of analysis.

Pre analyzing
The first step in analysing systems should always be the intelligent presentation of the original
data. Here the graphical presentation should have priority. In systems of high dimensions
coloured multiple picture graphics in three dimensions are a good way of getting an overview
of such high complexity. A lot of ready software is available as well on the market as in free
licences. In our software tools these options are normally part of it. For special cases we use
self developed packages.
By looking in different ways to the original data and playing around with all kind of
possibilities, one gets a good first impression about patterns and structures lying behind the
dataset. As next one can examine the dataset by different unconditional automatic analysis
tools. With unconditional we understand in this step of analysing, that no hypothesis,
classification or any kind of rules are part of the analysis. As result one can get structured
data, dependencies between the different dimensions of the system or classes which group the

- 11 -
Rules and laws the basis of modelling
The next step in interactive automatic data analysis is the use of more extensive software
tools, programming platforms or libraries in datamining, information retrieval, machine
learning etc. All of these provide a variety of algorithms, statistical analysis tools, graphical
user interfaces (GUI), input- output converters and much more useful modules, which all can
be put together to a ready program or used in script form as a platform with toolkits for the
different applications. Beside our own developed software we use software like “Weka”, “R”,
“S”, “RapidMiner”, “ggobi”, ”Lucene”, “Theme Selector”, “WordNet”, “Mathematika”,
“Mathlab” and several others, which are in this context not so important. With these toolkits
we have the opportunity to analyse the data sets under a variety of aspects. About the pure
statistical analysis we don’t want to talk here. There are anyway more comfortable programs
on the market than the above mentioned ones. We are interested in a more holistic view to the
The data sets we analyse represent more or less the structure and dependencies of the system
and its parameter. This should be reflected in patterns of the data. The pre analysis probably
gave us already an overview of such patterns, demonstrating the underlying structure of the
system. What it can’t show are rules and laws between the different parameters. For this we
use specialised software tools, which analyse the data concerning logical dependencies. The
outcome is the most probable rule set due to the algorithm tested. We test different algorithms
and combination of them and get different answers. Very often the algorithms don’t converge
or the resulting rule sets have a bad probability. Only in the best case one or a few rule sets
have an enough high probability that one can accept the result as representative.
To get more than one answer might be strange, but let me remind you we are dealing with a
high dimensional problem. In such a case to find an absolute minima or maxima in a
reasonable time is impossible. On the other hand the dataset available is not sufficient enough,
the system is under defined. So one has to live with the best relative result one can get and try
to understand by multiple iterative analyses the underlying laws.
The above described approach leads us to the process of modelling. With the knowledge of
similar behaviour in natural or physical systems we identify the similarities and differences.
This leads us to an understanding about parts of the system or in the best case a description of
the whole system.

- 12 -
Data Mining, Machine Learning and more

The basis of all of this are the techniques like Data Mining, Machine Learning, Information
Retrieval, Text Mining and others of intelligent computation. Data Mining is part of Machine
Learning. One could describe it as the process to detect pattern in data sets. The process is
automatic or more often semiautomatic. The pattern should give a meaningful description of
the system. The processes consist of learning techniques, which are partially based on
Bayesian statistics. If we use the term learning we use it in two definitions:

1. representation of the knowledge
2. ability of using the knowledge

There are two types of learning algorithms in Machine learning: The Black Box type like
neuronal networks, which don’t show the underlying decision structure, but give a structured
output as a result; and the Open Box type, which displays as well the decision structure as the
classified result. We will discuss here only the Open Box type, because we need the decision
structure for understanding the system and for model building.
The principles of Data Mining are shown in figure 4. Starting with data sets organised in
instances and attributes representing the system to be analysed, we get as described above and
further below tables of organised data “belonging together”, clusters of data subsets or rule
sets depending on the algorithms chosen. The aim of this type of data analysis is to get
iteratively closer and closer to an understanding of the regarded system.

Figure 4

Data Sets

Instances, Attributes

Tables, Rules, Cluster


- 13 -
Most of the learning techniques look for structured description of the data set, which is used
as the basis of learning. The outcomes of the analysis are rule sets. These descriptions can be
very complicated and not intuitive for humans. Therefore good presentation tools are
necessary. Good visualisations are decision trees representing the resulting rule set. These are
easy to understand and give a good overview about the dependencies of the system
If one talks about machine learning and statistics one shouldn’t make any difference. Both just
have their origin in different historical traditions. In parallel they developed in the two
scientific areas similar solutions for creating classification and decision trees out of learning
data sets. Also in developing methods of nearest neighbours for classification the research
generated similar solutions independently at the same time. The main difference between
statistics and Machine Learning is the way how we use the underlying mathematics. We
understand our methods as “generalising as a way of search”. One possibility of
understanding the problem of learning is to imagine this as a search in a concept space.
Concepts under this definition are descriptions of the system in form of rule sets derived from
the example dataset used for learning. As we mentioned already above learning means in this
sense, to find the best concept for the regarded data sets representing the analysed system in
the resulting concept space. As we discussed this implicitly leads in praxis to the problems of
convergence and multiple solutions. We regard this not as a problem but more as a chance to
find the right models helping to understand complexity. To remind you – there is not “the
solution” to describe a complex System; there is only the best approximation derived from
the knowledge at the point of analysis -. The consequence for the data analysis and model
building is, to use all possible information available about the system even if the information
has different forms and is incomplete. Therefore it is so important to use a huge tool box
having access to a variety of tools. But it also means that you have to try the different tools to
find the adequate one. If you want to screw in a Phillips screw you can’t use a normal
screwdriver you need a Phillips one. I think this describes our way working the best, we have
to find always the right tool or tool sets because we can’t see exactly what kind of object we
are working on. By trying out different methods and getting different views to the searched
object we get time to time, step to step a better feeling of this object leading closer and closer
in understanding it. Figure 5 shows this procedure including the evolutionary part, which we
take from nature. One should understand these symbolic steps as an iterative process resulting
in a “
Final? Model”.

- 14 -

Figure 5
The theoretical world is ideal, but we have learned that even than we can not solve the
questions we have to each precision wanted. Much more difficult is the real world. So if one
want to apply the described techniques to practical cases one has to adapt these to the special
conditions of the regarded real system. In Data Mining applications to reality one discriminate
between four different kinds of learning displayed in Figure 6.

Figure 6

Evolutionary natural systems

Associated Models

Data Sets

System Structure


System Rules

Data Mining
“Final?” Model
• Classifying Learning
• Associated Learning
• Clustering
• Numeric Prediction

- 15 -
In the case of classifying learning the learning algorithm takes a set of classified data
to learn the classifying rules and predict on this basis the class of a new unknown
data set. In associated learning all associations between the attributes are taken into
account to classify, not only the associations which directly predict a special class.
Clustering builds groups of data subsets which are closely related, while numeric
prediction doesn’t result in a discrete class but gives as a result a numerical value
Classifying learning is sometimes called supervised learning, because defined
classification are fix for the hole training data set. In contrast to this associated
learning have no defined classes. Association rules differ in two aspects from
classifying rules. First they can predict each single attribute, not only the class, and
second they can predict several attributes at the same time. Therefore there are
more association rules than classification ones.

Text Mining a special Case

Statistical Natural Language Processing (NLP) uses partially the same algorithms and
methods, which are used in Machine Learning. In NLP people use simply texts and regard the
textual context as a surrogate for situating language in a real world context. A body of text is
called corpus and a collection of such bodies respectively corpora. In the NLP approach the
aim is to assign probabilities to linguistic events, so that we can say which sentences are
“usual” and “unusual”. Statistical NLP practitioners are interested in good descriptions of the
associations and preferences that occur in the totality of language use. A strong argument for
probability as part of a scientific understanding of language is that human cognition is
probabilistic and that language must therefore be probabilistic too since it is an integral part of
cognition. The argument for a probabilistic approach to cognition is that we live in a world
with uncertainties and incomplete information. To be able to interact successfully with the
world. we need to be able to deal with this kind of information.
NLP has to deal as well as the human with a lot of problems, which are due to this uncertenty
of information in the real world. one of the main problems is the ambiguity of the language.
An example is shown in figure 7, taken from reference 4.

- 16 -

Figure 7 Ambiguity of language

- 17 -
A good NLP system must be powerful in making disambiguation decision beside of
solving different other uncertainties in language and grammar.
Statistical NLP methods have led the way in providing successful disambiguation in large
scale systems using naturally occurring text. The parameters of such NLP models can
often be estimated automatically reference text corpora. For generating such “knowledge”
one needs lexical resources in machine-readable form. One of the old well known
reference corpora is the Brown corpus. It was generated in the 1960’s and 1970’ at the
Brown university and represents the American English at its time very well. Beside of
collecting the words of a corpus one needs to assign the words with its kind like verb,
noun, adjective etc. this procedure is called tagging. A lot of other information is extracted
from such reference corpus and at the end gives a reasonable statistical NLP model for
text retrieval. Discussing this in all details would exceed the extend of this paper.
Another problem in NLP is the enormous masses of data to be processed. Such reference
corpora consist of millions of words. The corpora to be analyzed again are in the same
range. To handle these amount of data in a reasonable time with full text analysis is not
possible. The modern way to handle this is an intelligent indexing. Out of each corpus one
generates an index and put this together in a database. Very often one generates different
index databases under different aspects like distances between words and other
characteristics of content and grammar. To have different index data bases is especially
advantages in modern parallel computing. For generating such index databases are very
effective software tools existing, which are optimized in speed and efficiency just for this
After generating all the necessary databases, the “Full” text search is mostly done in the
adequate index data bases which is extremely fast and efficient. Out of the index data
bases the original text and position is referenced and can be automatically shown or
processed in a further step. This procedure is even advisable if one only one times
searches in a text corpus, because normally one does not look only for a word but for
knowledge bound to words. Here I want to stop the explanation of Text Mining. The
reality is more complex and needs a lot of experience.

- 18 -

Some Words about Modelling

The above described techniques of automatic analysing different kind of data like text data,
data from questionnaires, pure numeric data and so on, help us to extract knowledge for
understanding the analysed topic or system. These automatic searches may also generate some
logical models in form of rule sets and laws, but it should never reduce our thinking about the
system we are looking at. on the contrary it should activate and enhance our imagination
about the topics describing the searched object. Especially because we very often get multiple
answers they will stimulate our brain to new views and insights into the system.
Aside of the fact that machine learning provides a lot of useful tools to look at the original
data sets and the resulting data structure, the received automatic results are not sufficient to
build a comprehensive model. Because we have learning data analysis software it is very easy
to ad new datasets to the analysis. So it becomes advantages to iterative processes for
modelling. This way one can acquire specific datasets, witch verify the model approach or
mark it as a false way. Because of the underlying Bayesian statistic one can easily test the
hypotheses network behind the different model and verify this way the most probable
Until now we were mostly speaking about logic, probabilistic and relational models. For
describing a system or at least part of it (sub system) and to understand its dynamic it is better
to have a mathematical model, which is determined by equations and numeric parameters.
Especially if one wants to compare models of natural sciences one needs to make the
description of the system more mathematical. We normally try to approach this problem by
developing system descriptions from both sides.
First we apply interactively Machine Learning and Data Mining techniques to get an overview
about the structure and behaviour of the system. Than we look for analogue systems in nature
or natural sciences. In the next step we analyse the datasets which can be formulated in a
numerical way mathematically using the proposed analogue model. Iteratively we improve
the description of the system until we can’t get a better model based on the available data
sets. The big advantages of using software tools like Machine Learning, Data Mining etc. for
automatic data processing are, the transparency of the model building and the easiness of
extending the modelling if new information or datasets are available. All processes are stored
in a process scheme and can be used automatically in a new run of data analysis for improving
the model.

- 19 -

Application to Social Sciences

The modelling in social sciences is much more difficult than in natural sciences. In social
sciences we are dealing mostly with empirical data which are raised from questionnaires.
Even quantified in scales we would call it in comparison to data normally acquired in natural
sciences (continuous numerical data) quasi numerical. But especially because of the
inhomogeneous type of data the modern data analysis strategies are aside of the used
statistical analysis tools best suited for use in social sciences. Until now this field is quite
empty in reasonable examples with one exception, demoscopic research and applications. In
this field large software packages are developed and entrained by a large amount of people in
different countries and different sub groups for political and users behaviour prediction. Some
platforms for Text Mining offering help for researchers are slowly developing like the
National Centre for Text Mining in the UK (http://www.nactem.ac.uk
) which encourages the
use of Text Mining in the social sciences.
But what is the advisable use for a researcher in social sciences? Is it not too much work and
time learning to use these techniques? – Yes – we would answer, because for the normal
research in social sciences the modern statistical tools do the right job and the effort you need
to understand all the algorithms and techniques is too much work and time for the additional
information one can get. There are also no ready to use software packages which are directed
to social sciences. The practical way in our days would be the collaboration with a group of
specialists, which provide the knowledge and are interested in different applications.
There are three aspects which give advantages to the normal statistical analyses:
1. Additional information to the normal query analysis
2. Automatic text analysis
3. Generating of logical trees and rules.
Beside of this a lot of graphical tools due to preanalysis of data sets, which is a usual
procedure in semi automatic data processing, are helpful in a first validation. For us, who
want to model dependencies between different aspects of social dimensions like cultural
dependencies from non homogeneous different databases, these techniques are a necessity to
be able to combine all information available. What is a must for us might help social scientists
to understand the data in a different way and might give new views to the searched subjects.

- 20 -


Witten I.H., and Frank E., Data Mining, (2001), Carl Hanser Verlag

Manning C.D. and Schütze H., Foundations of Statistical natural Language Processing,
(Sixth printing 2003), MIT Press

Patton M.Q., Qualitative Research & Evaluation Methods, (third edition 2002), Sage

Vijver F. van de and Leung K., Methods and Data Analysis for Cross-Cultural Research,
(1997), Sage Publications

WEB page of National Centre for Text Mining in the UK (http://www.nactem.ac.uk

WEB pages of Wikipedia
http://en.wikipedia.org/wiki/Machine_learning) under the
keywords: Machine learning and Text Mining