機器學習與音訊辨識在HMI的整合應用

stuckwarmersΚινητά – Ασύρματες Τεχνολογίες

14 Δεκ 2013 (πριν από 3 χρόνια και 5 μήνες)

213 εμφανίσεις

2013/12/14

1

J.
-
S. Roger Jang

張智星)

Multimedia Information Retrieval (MIR) Lab

CS Dept, Tsing Hua Univ., Taiwan

http://mirlab.org/jang

機器學習與音訊辨識在
HMI
的整合應用

-
2
-

Outline


HMI
簡介


語音辨識


Siri


音樂檢索


Shazam


Soundhound (Midomi)



未來趨勢

-
3
-

HMI
簡介


HMI: Human machine interface


HCI, CHI


HMI
演進


過去:電腦圖形介面、滑鼠


現在:多點觸控、體感偵測、語音助理


未來:觸感觸控、
3D
虛擬實境

-
4
-

HMI
簡介

(II)


HMI
蓬勃發展的主要原因


週邊裝置的改進:觸控螢幕、三軸加速器



電腦速度的躍進:
Moore’s law


機器學習理論的推進


IEEE Feature Story: The Motion Tech Behind Kinect

-
5
-

Then and Now


Retrospect: Apple’s visionary video


Knowledge navigator (1987)


Now


Android pad/phone


iPad/iPhone


FaceTime


Siri

-
6
-

Siri


An offshoot of the DARPA
-
funded project, CALO,
based at SRI.


Functionality


Schedule meetings (calendar)


Place phone calls (contacts).


Read and write messages (text and email).


Interact with the Maps app and location services.


Forward search phrases to certain pre
-
defined data providers
(Yahoo! Weather, Yahoo! Finance, Yelp, Wolfram|Alpha, or
Wikipedia).


Video demo

-
7
-

Underlying Technologies of Siri


Speech recognition (Nuance, with 56
languages)


Natural language understanding (CALO)


Machine learning and data mining for
collecting, modeling and fusing multiple
sources of information

-
8
-

Categorization of ASR


Tasks


Voice commands


Keyword spotting


Spoken language
understanding


Dictation


Environment


Quiet office


Noisy


Platforms


Embedded devices


PC


Server
-
based cloud
computing


Methods


Template
-
based
(speaker
-
dependent)


Statistics
-
based
(speaker
-
independent)

-
9
-

ASR: A General Flowchart


Source: Jeffrey A. Bilmes, “Natural Statistical Models for Automatic Speech Recognition”

Grammar

Language

models

-
10
-

Grammars in ASR: Voice Commands


Original form


After optimization

-
11
-

Grammars in ASR: Keyword Spotting


Possible utterances


Jack


Call Jack


Dial Jack


Roger, please


Call Roger


Connect me with Roger





Grammar (Lexicon net)



-
12
-

Grammars in ASR: SLU


Possible utterances


Please reserve a table for
two


I’d like to make a
reservation for 4 persons
at 6pm


Reserve a table for 4 for
me





Grammar (Lexicon net)

-
13
-

Siri as a Virtual Personal Assistant


Awareness


Locations


Time


Personal info


Contact info


Tasks


External APIs for help


Speech


Speech to text


Speaker adaptation


Natural language
processing


Task & domain models


Text to intent


Dialog flow


-
14
-

Example Dialogs


Contacts


Tell my wife I’m
running late


Remind me to call the
vet


Web services


How to get to time
square? (Map)


How many calories in an
egg? (Wiki)


Context awareness


U: Any good burger
joints around here?


S: I found a number of
burger restaurants near
you.


U: Hmm. How about
tacos?

-
15
-

Apps Required for Dinner Planning


-
16
-

Siri for Dinner Planning



Multiple
-
criteria
vertical search


Combining multiple
information sources


With integrated
transactions


Task
-
orient
communication

-
17
-

Siri: Master of Mash
-
ups

Source: Tom Gruber, “Siri, a Virtual Personal Assistant”

-
18
-

Lots of Domain & Task Models


-
19
-

Lots of Web Services

Contacts

Email

Maps

And many many more…

-
21
-

When Siri Succeeds & Fails…


Fails: Scottish accents


Scottish guy Vs Siri.


iPhone has problems
with Scottish accents


Siri can't understand
Scottish folk.



Succeeds: Indian
accents


My Conversation with
Siri (Indian Accent)


iPhone 4S's
-
Siri
-

Working with an Indian
Accent

-
22
-

Speech Interface for Edutainment


Edutainment prototypes developed by
MIR


Idiom relay
(成語接龍)


Recitation machine

(唸唸不忘)


Bricks of idioms

(一語中的)


Where to download?


http://mirlab.org/mir_products.asp

-
23
-

Music Identification:
Shazam


Facts


First commercial product of audio fingerprinting


Since 2002, UK


Technology


Audio fingerprinting


Founder


Avery Wang (PhD at Standard, 1994)

-
24
-

Music Identification:
Soundhound


Facts


First product with multi
-
modal music search


AKA:
midomi


Technologies


Audio fingerprinting


Query by singing/humming


Speech recognition


Founder


Keyvan Mohajer (PhD at Stanford, 2007)

-
25
-

Audio Fingerprinting


Goal


Identify a noisy version of
a given music clips


Technical barrier


Robustness


Efficiency (6M tags/day
for Shazam)


Database collection (15M
tracks for Shazam)



Applications


Song purchase


Royalty assignment
(over radio)


Confirmation of
commercials (over TV)


Copyright violation
(over web)


TV program ID

-
26
-

Landmark
-
Based Audio Fingerprinting


Offline: Database
construction


Landmark detection


Hash table creation


Online: Application


Landmark detection


Hash table search


Ranking results


(Source: Dan Ellis)

-
27
-

Demos of Audio Fingerprinting


Shazam


Soundhound

-
28
-

Query by Singing/Humming


Goal


Identify a singing or
humming clip


Technical barrier


Robustness


Efficiency


Database collection


MIDI


MP3



Applications


Karaoke


Singing transcription


Singing scoring


Technical details


Query by
singing/humming


-
29
-

Demos of QBSH


Demos on PC


Real
-
time pitch tracking


Pitch scaling


Miracle


13000 songs in database


Singe server with GPU


Demos on embedded
systems


Toys

-
30
-

Other Demos of Music Applications


Beat tracking


Music genre classification


Audio melody extraction


Score following


Drum identification


Vocal suppression


Vocal extraction


-
31
-

In the Future…


-
32
-

VPA Evolution

(c) 2009 Siri, Inc.

Getting Personal

Doing Things For You

Getting What You Say

Today

Tomorrow

Speech

Location

Date/time

Conversational UI

Service APIs

Faceted Search Domains

Future

Calendars

Contacts

Profile

Favorites

Data feeds

Recommendation Services

SW & Data Commons

Auth Standards

Linguistic NLP

Social Contexts

Semantic NLP

Explicit Preferences

Learned Preferences

-
33
-

Dialogs of Future VPA


Fact finding


請問何謂「五代十國」


請問周董的緋聞女友有
哪幾位?發生在何時?



IBM’s Watson


Comparison


請問哪一家手機資費方
案比我目前的方案好?


(女)請問
Jack
比較適
合我,還是
Hank?



Location
-
based


請幫我找一間泰式餐廳
,靠近捷運站,離我辦
公室約捷運十分鐘可達


Tasks


我如果十點前睡覺,請
明天六點叫我起床


如果明天早上下雨,請
自動取消鬧鐘

-
34
-

Music Apps. in the Future


Music apps


Singing scoring


Sensibol.com


www.thinkit.com.cn


Drumming


Conducting




Technologies


Audio melody extraction


Singing pronunciation
scoring


Detection of


Drums


Vibrato


Body movement


Face expression


-
35
-

Thank you for listening.

Questions and comments welcome!

-
36
-

Speech & Music Libraries by MIR Lab


MIR lab

and the
demo page


Music


Query by singing/humming


Audio fingerprinting


Speech


Voice command


Speech scoring