Smart Software with F#

nostalgicisolatedΛογισμικό & κατασκευή λογ/κού

4 Νοε 2013 (πριν από 4 χρόνια και 8 μέρες)

55 εμφανίσεις

1



Smart Software with F#


Joel
Pobar

Language Geek

http://callvirt.net/blog


2

Agenda

What is it?

F# Intro

Algorithms:

Search

Fuzzy Matching

Classification (SVM)

Recommendations

Q&A

3

All This in 45 mins?

This is an awareness session!

Lots of content, very broad, very fast

You’ll get all demos, pointers, and slide deck to take
offline and digest

Two takeaways:

F# is a great language for data

Smart algorithms aren’t hard


use them, explore
more!


4

F# is



...a
functional, object
-
oriented, imperative and
explorative

programming language for .NET


what is Functional Programming?



http://callvirt.net/jaoo.zip



5

What is Functional Programming?

Wikipedia: “A programming paradigm that
treats computation as the evaluation of
mathematical functions and avoids state and
mutable data”


-
> Emphasizes functions

-
> Emphasizes shapes of data, rather than
impl
.

-
> Modeled on lambda calculus

-
> Reduced emphasis on imperative

-
> Safely raises level of abstraction

6

Motivation for Functional

Simplicity in life is good: cheaper, easier, faster,
better.

We typically achieve
simplicity

in software in two
ways:

By raising the level of abstraction (and OO was one design
to raise abstraction)

Increasing modularity

Increasing signal to noise another good
strategy:

Communicate more in less time with more clarity

Better composition and modularity == reuse




7

Functional Programming

Safer, while still being useful

Unsafe

Safe

Useful

Not Useful

C#, C++, …

V.Next
#

Haskell

F#

8

What is F# for?



F# is a
General Purpose
language

Can be used for a broad range of programming
tasks

Superset

of imperative and dynamic features

Great for learning FP

concepts

Some particularly important domains

Financial modeling and analysis

Data mining

Scientific data analysis

Domain
-
specific modeling

Academic



9

Let

‘Let’ binds values to identifiers


let

helloWorld = “Hello, World”

print_any helloWorld


let
myNum = 12



let

myAddFunction

x y =


let

sum = x + y


sum


Type inference.

The
static typing

of C# with
the
succinctness

of a scripting
language

10

Tuples

Simple, and most useful data structure


let

site1 = (“msdn.com”, 10)

let

site2 = (“abc.net.au”, 12)

let

site3 = (“news.com.au”, 22)

let

allSites

= (site1, site2, site3)


let

fst

(a, b) = a

let

snd

(a, b) = b

11

Lists, Arrays,
Seq

and

Options

Lists & Arrays are first
-
class citizens

Options provide a some
-
or
-
nothing capability

let

list1 = [“Joel"; "Luke"]

let

array = [|2; 3; 5;|]

let

myseq

=
seq

[0; 1; 2; ]


let

option1 = Some(“Joel")

let

option2 = None

12

Records

Simple concrete type definition


type
Person =

{ Name: string;


DateOfBirth
:
System.DateTime
; }


let

n = { Name = “Joel”;


DateOfBirth

= “13/04/81”; }

13

Immutability (by default)



Values may not
be changed

Data is immutable by
default

14

Discriminated Unions

Great for representing the structure of data

type
Make = string

type
Model = string

type
Transport =


| Car of Make * Model


| Bicycle


let

me = Car (“Holden”, “
Barina
”)

let

you = Bicycle

Both of these identifiers are of
type “Transport”

15

Functions

Functions: like delegates + unified and simple

Deep type inference


(
fun

x
-
>

x + 1)


let

myFunc x
= x +
1

val

myFunc
: int
-
>

int


let rec

factorial n =


if n>1 then n * factorial (n
-
1)


else 1


let

data = [5; 3; 4; 4; 5]

List.sort (
fun

x y
-
>

x


y) data

16

Pattern Matching


let
(
fst
, _) = (“first”, “second”)


Console.WriteLine
(
fst
)



let
switchOnType
(a:obj)


match
a
with


| :? Int32
-
>
printfn


int
!”


| :? Transport
-
>
printfn

“Transport“


| _
-
>
printfn

“Everything Else!”

Very important part of F#

Helps deal with the ‘teasing apart’ of data

Works best with Discriminated Unions & Records


17

Lists, Types, Interactive

18

Search

Given a search term and a large document
corpus, rank and return a list of the most
relevant results…

19

Blog Crawler

20

Search

Words

Stemming? Tokenize?

E.g

‘Python/Ruby’

Markup

Title,

Author, Date

Headings

(h1,h2 etc)

Paragraphs

Links

A sign of strength
?

Let’s explore something simple…

21

Search

Simplify:

For easy machine/language manipulation

… and most importantly, easy computation

Vectors: natures own quality data structure

Convenient machine representation (lists/arrays)

Lots of existing vector math algorithms

After a loving
incubation period,
moonlight 2.0 has
been released. <a
href
=“whatever”>sour
ce code</a><
br
><a
href”something

else”>
FireFox

binaries</a> … after

2

after

1

incubation

1

loving

6

moonlight

4

firefox

6

linux

2

binaries

22

Term Count

Document1: Linux post:



Document2: Animal post:



Vector space:

9

the

1

incubation

1

crazy

6

moonlight

4

firefox

6

linux

2

penguin

2

the

1

dog

5

penguin

9

the

1

incubation

1

crazy

6

moonlight

4

firefox

6

linux

0

dog

2

penguin

2

0

2

0

0

0

1

5

2

crazy

23

Term Count Issues

‘the dog penguin’

Linux: 9+0+2 = 11

Animal: 2+1+5 = 8

‘the’ is overweight

Enter
TF
-
IDF
: Term Frequency Inverse
Document Frequency

A weight to evaluate how important a word is to a
corpus

i.e. if ‘the’ occurs in 98% of all documents, we shouldn’t
weight it very highly in the total query

9

the

1

incubation

1

crazy

6

moonlight

4

firefox

6

linux

0

dog

2

penguin

2

0

2

0

0

0

1

5

24

TF
-
IDF

Normalise the term count:

tf

=
termCount

/
docWordCount


Measure importance of term

idf

= log ( |D| /
termDocumentCount
)

where |D| is the total documents in the corpus


tfidf

=
tf

*
idf

A high weight is reached by high term frequency,
and a low document frequency

25

Search Engine in under 10
mins

26

Fuzzy Matching

String similarity algorithms:

SoundEx
;
Metaphone

Jaro

Winkler Distance; Cosine similarity; Sellers;
Euclidean distance; …

We’ll look at
Levenshtein

Distance algorithm


Defined as:
The minimum edit operations which
transforms string1 into string2



27

Fuzzy Matching

Edit costs:

In
-
place copy


cost 0

Delete a character in string1


cost 1

Insert a character in string2


cost 1

Substitute a character for another


cost 1

Transform

‘kitten’

in to
‘sitting’

kitten
-
>
sitten

(cost 1


replace k with s)

sitten

-
>
sittin

(cost 1
-

replace e with
i
)

sittin

-
> sitting (cost 1


add g)

Levenshtein

distance:
3



28

Fuzzy Matching

Estimated string similarity computation costs:

Hard on the GC (lots of temporary strings created
and thrown away, use arrays if possible.


Levenshtein

can be computed in O (
kl
) time, where

l
’ is the length of the shortest string, and ‘k’ is the
maximum distance.

Parallelisable


split the set of words to compare
across n cores.

Can do approximately 10,000 compares per second
on a standard single core laptop.



29

Did You Mean?

30

Classification

Support Vector Machines (SVM)

Supervised learning for binary classification

Training Inputs: ‘in’ and ‘out’ vectors.

SVM will then find a separating ‘
hyperplane
’ in an n
-
dimensional space

Training costs, but classification is cheap

Can retrain on the fly in some cases


31

SVM Classification

32

SVM Issues

Classification on 2 dimensions is easy, but most
input is multi
-
dimensional

Some ‘tricks’ are needed to transform the input
data

33

SVM Classifier

34

F# and Algorithms

Netflix Demo


Netflix Prize
-

$1 million USD

Must beat Netflix prediction algorithm by 10%

480k users

100 million ratings

18,000 movies

Great example of deriving value out of large
datasets

Earns Netflix loads and loads of $$$!

35

MovieId

CustomerId

Rating

Clerks

444444

5

Clerks

2093393

4

Clerks

999

5

Clerks

8668478

1

Dogma

2432114

3

Dogma

444444

5

Dogma

999

5

...

...

...

Nearest Neighbour

Find neighbours who like what I like

36

MovieId

CustomerId

Rating

Clerks

444444

5

Clerks

2093393

4

Clerks

999

5

Clerks

8668478

1

Dogma

2432114

3

Dogma

444444

5

Dogma

999

5

...

...

...

Netflix Data Format

Netflix Demo

37

CustomerId

302

4418

3

56

732

444444

5

4

5

2

999

5

5

1

111211

3

5

3

66666

5

5

1212121

5

4

5656565

1

454545

5

5

Nearest Neighbour Algorithm

Find all my neighbours movies

Find the best movies my
neighbours

agree on

38

Netflix Recommendations

39

A Short Stop
-
over at Vector Math

A (x1,y1)

B (x2,y2)

C (x0,y0)

If we want to calculate the distance between A and B, we call on Euclidean Distance


We can represent the points in the same way using Vectors: Magnitude and Direction.


Having this Vector representation, allows us to work in ‘n’ dimensions, yet still achieve

Euclidean Distance/Angle calculations.

40

Q & A

Any questions?

http://callvirt.net/

joelpobar@gmail.com

THANKS!