ocelotgiantAI and Robotics

Nov 7, 2013 (4 years and 8 months ago)


Using Blackboard Systems for
Polyphonic Transcription

A Literature Review

by Cory McKay


Intro to polyphonic transcription

Intro to blackboard systems

Keith Martin’s work

Kunio Kashino’s work

Recent contributions


Polyphonic Transcription

Represent an audio signal as a score

Must segregate notes belonging to different

Problems: variations of timbre within a
voice, voice crossing, identification of
correct octave

No successful general purpose system to

Polyphonic Transcription

Can use simplified models:

Music for a single instrument (e.g. piano)

Extract only a given instrument from mix

Use music which obeys restrictive rules

Simplified systems have had success rates
of between 80% and 90%

These rates may be exaggerated, since only
very limited testing suites generally used

Polyphonic Transcription

Systems to date generally identify only
rhythm, pitch and voice

Would like systems that also identify other
notated aspects such as dynamics and

Ideal is to have system that can identify and
understand parameters of music that
humans hear but do not notate

Blackboard Systems

Used in AI for decades but only applied to music
transcription in early 1990’s

Term “blackboard” comes from notion of a group
of experts standing around a blackboard working
together to solve a problem

Each expert writes contributions on blackboard

Experts watch problem evolve on blackboard,
making changes until a solution is reached

Blackboard Systems

“Blackboard” is a central dataspace

Usually arranged in hierarchy so that input is at
lowest level and output is at highest

“Experts” are called “knowledge sources”

KSs generally consist of a set of heuristics and a
precondition whose satisfaction results in a
hypothesis that is written on blackboard

Each KS forms hypotheses based on information
from front end of system and hypotheses
presented by other KSs

Blackboard Systems

Problem is solved when all KSs are satisfied
with all hypotheses on blackboard to within
a given margin of error

Eliminates need for global control module

Each KS can be easily updated and new
KSs can be added with little difficulty

Combines top
down and bottom

Blackboard Systems

Music has a naturally hierarchal structure
that lends itself well to blackboard systems

Allow integration of different types of

signal processing KSs at low level

human perception KSs at middle level

musical knowledge KSs at upper level

Blackboard Systems

Limitation: giving upper level KSs too much
specialized knowledge and influence limits
generality of transcription systems

Ideal system would not use knowledge above the
level of human perception and the most
rudimentary understanding of music

Current trend is to increase significance of upper
level musical KSs in order to increase success rate

Keith Martin (1996 a)

“A Blackboard System for Automatic
Transcription of Simple Polyphonic Music”

Used a blackboard system to transcribe a four
voice Bach chorale with appropriate segregation
of voices

Limited input signal to synthesized piano

Gave system only rudimentary musical
knowledge, although choice of Bach chorale
allowed the use of generally unacceptable
assumptions by lower level KSs

Keith Martin (1996 a)

end system used short
time Fourier
transform on input signal

Equivalent to a filter bank that is a gross
approximation the way the human cochlea
processes auditory signals

Blackboard system fed sets of associated
onset times, frequencies and amplitudes

Keith Martin (1996 a)

Knowledge sources made five classes of
hierarchally organized hypotheses:






Keith Martin (1996 a)

Three types of knowledge sources:

Garbage collection


Musical practice

Thirteen knowledge sources in all

Each KS only authourized to make certain
classes of hypotheses

Keith Martin (1996 a)

KSs with access to upper
level hypotheses can put
“pressure” on KSs with lower
level access to
make certain hypotheses and vice versa

if the hypotheses have been made that
the notes C and G are present in a beat, a KS with
information about chords might put forward the
hypothesis that there is a C chord, thus putting
pressure on other KSs to find an E or Eb.

Used a sequential scheduler to coordinate KSs

Keith Martin (1996 b)

“Automatic Transcription of Simple
Polyphonic Music: Robust Front End

Previous system often misidentified octaves

Attempted to improve performance by
shifting octave identification task from a
down process to a bottom
up process

Keith Martin (1996 b)

Proposes the use of log
lag correlograms in front

Models the inner hair cells in the cochlea with a
bank of filters

Determines pitch by measuring the periodic
energy in each filter channel as a function of lag

Correlograms now basic unit fed to blackboard

No definitive results as to which approach is better

Kashino, Nadaki, Kinoshita and
Tanaka (1995)

“Application of Bayesian Probability Networks to
Music Scene Analysis”

Work slightly preceded that of Martin

Used test patterns involving more than one

Uses principles of stream segregation from
auditory scene analysis

Implements more high
level musical knowledge

Uses Bayesian network instead of Martin’s simple
scheduler to coordinate KSs

Kashino, Nadaki, Kinoshita and
Tanaka (1995)

Knowledge sources used:

Chord transition dictionary

note relation

Chord naming rules

Tone memory

Timbre models

Human perception rules

Used very specific instrument timbres and musical
rules, so has limited general applicability

Kashino, Nadaki, Kinoshita and
Tanaka (1995)

Tone memory: frequency components of
different instruments played with different

Found that the integration of tone memory
with the other KSs greatly improved
success rates

Kashino, Nadaki, Kinoshita and
Tanaka (1995)

Bayesian networks well known for finding good
solutions despite noisy input or missing data

Often used in implementing learning methods that
trade off prior belief in a hypothesis against its
agreement with current data

Therefore seem to be a good choice for
coordinating KSs

Kashino, Nadaki, Kinoshita and
Tanaka (1995)

No experimental comparisons of this
approach and Martin’s simple scheduler

Only used simple test patterns rather than
real music

Kashino and Hagita (1996)

“A Music Scene Analysis System with the MRF
Based Information Integration Scheme”

Suggests replacing Bayesian networks with
Markov Random Field hypothesis network

Successful in correcting two most common
problems in previous system:

Misidentification of instruments

Incorrect octave labelling

Kashino and Hagita (1996)

based networks use simulated annealing to
converge to a low
energy state

approach enables information to be
integrated on a multiply connected hypothesis

Bayesian networks only allow singly connected

Could now deal with two kinds of transition
information within a single hypothesis network:

chord transitions

note transitions

Kashino and Hagita (1996)

Instrument and octave identification errors
corrected, but some new errors introduced

Overall, performed roughly 10% better than
based system at transcribing 3
part arrangement of
Auld Lang Syne

Still only had a recognition rate of 71.7%

Kashino and Murase (1998)

Shifts some work away from blackboard system
by feeding it higher
level information

Simplifies and mathematically formalizes notion
of knowledge sources

Switches back to Bayesian network

Perhaps not truly a blackboard system anymore

Has very good recognition rate

Scalability of system is seriously compromised by
new approach

Kashino and Murase (1998)

Uses adaptive template matching

Implemented using a bank of filters
arranged in
parallel and a number of
templates corresponding to particular notes
played by particular instruments

The correlation between the outputs of the
filters is calculated and a match is then
made to one of the templates

Kashino and Murase (1998)

Achieved recognition rate of 88.5% on real
recordings of piano, violin and flute

Including templates for many more instruments
could make adaptive template matching intractable

Particularly a problem for instruments with

Similar frequency spectra

A great deal of spectral variation from note to note

Hainsworth and Macleod (2001)

“Automatic Bass Line Transcription from
Polyphonic Music”

Wanted to be able to extract a single given
instrument from an arbitrary musical signal

Contrast to previous approaches of using
recordings of only one instrument or a set of
defined instruments

Hainsworth and Macleod (2001)

Chose to work with bass

Can filter out high frequencies

Notes usually fairly steady

Used simple mathematical relations to trim
hypotheses rather than a true blackboard

Had a 78.7% success rate on a Miles Davis

Bello and Sandler (2000)

“Blackboard Systems and Top
Down Processing
for the Transcription of Simple Polyphonic

Return to a true blackboard system

Based on Martin’s implementation, using a
conventional scheduler

Refines knowledge sources and adds high
musical knowledge

Implements one of knowledge sources as a neural

Bello and Sandler (2000)

The chord recognizer KS is a feedworard network

Trained using the spectrograph of different chords
of a piano

Trained network fed a spectrograph and outputs
possible chords

Can therefore output more than one hypothesis at
each iteration

Gives other KSs more information and allows
parallel exploration of solution space

Bello and Sandler (2000)

Could automatically retrain network to recognize
spectrograph of other instruments with no manual
modifications needed

Preliminary testing showed tendency to
misidentify octaves and make incorrect
identification of note onsets

These problems could potentially be corrected by
signal processing system that feeds blackboard


Bass transcription system and more recent
work of Kashino useful for specific
applications, but limited potential for
general transcription purposes

True blackboard approach scales well and
appears to hold the most potential for
purpose polyphonic transcription


Use of adaptive learning in knowledge
sources seems promising

Interchangeable modules could be
automatically trained to specialize in
different areas

Could have semi
automatic transcription,
where user chooses correct modules and
system performs transcription using them