Adam Kilgarriff,* Vit Suchomel*^

roughhewnstupidInternet και Εφαρμογές Web

18 Νοε 2013 (πριν από 4 χρόνια και 5 μήνες)

72 εμφανίσεις

Adam Kilgarriff,*
Vit

Suchomel
*^

*Lexical Computing Ltd, UK

^NLP Laboratory,
Masaryk

Univ
, Brno,
Cz

1


The particular Moroccan oil could very well moisturize dry
skin handing it out an even makeup including easier
different textures.Now on the web stores are very aggressive price smart so
there genuinely isn’t any very good cause to go way out of
your way to get the presents (unless of course of program
you procrastinated).Hemorrhoids

sickliness is incorrect to be considered as a
lethiferous

malaise even though shut
-
ins are struck with
calamitous tantrums of agonizing hazards, bulging
soreness and irritating psoriasis.

2


We don’t want it


OUP


We can’t take examples from your corpus without
checking every one because they might be web spam


More (or, cleverer) than it used to be


enTenTen12
vs

enTenTen08


Hard to filter (by design)


Moving target

3

Web spamming :


“actions intended to mislead search engines into
ranking some pages higher than they deserve”


Gyongyi

and Garcia
-
Molina, 2005

4

Moroccan Oil is alcohol
-
free and has a patented

weightless formula with no build up. Softens

thick unmanageable hair and restores shine and

softness to dull lifeless hair. Instantly absorbed

into the hair. Moroccan Oil will help eliminate

frizz, speeds up styling time by 40%, and provides

long
-
term conditioning to all hair types.

Are $20 shampoos and conditioners worth it?

Can good hair
-
care products be found at the

drugstore, or are the expensive salon products really

superior?

5


6


Taxonomy of techniques


Gy
¨
ongyi

and Garcia
-
Molina, 2005


AIRWEB (Adversarial IR on the web)


Five workshops 2005
-
09


Held at WWW conferences


Since2009


attention shifted to social
-
network spam

7


Best dataset: UK2007


Second challenge


Collecting labelling effort


6479 hosts manually classified


6% of data is spam


Uniform random sampling of web crawl


By domain, not by page


Six participants in challenge


All used supervised learning


Top score: 85% area under curve


8


No search engines, no web spam


Lots of data, expertise, algorithms


All privateAIRWEB included representativesBootCaT


Piggybacks on search engine anti
-
spam methods

9


Level of web page, not host, domain


Contra
AirWEB


Lacks coherence


Text with injections or target terms


Anomalous words

10


Hypothesis


Biggest difference is web spam


Keywords method


Frequencies in each corpus


Normalise to per million


Add smoothing parameter 0.001 to each


Divide higher number by lower


Two keyword lists


Dictionary filter


hunspell


Study top 100 of each

11

enTenTen12

enTenTen08

lc

Freq

Freq/mill

Freq

Freq/mill

Score

Rank

tweeted

28711

2.2

11

0.0

507.41

1

jewelries

18012

1.4

35

0.0

118.72

2

tweeting

26024

2.0

67

0.0

93.40

3

colorway

6395

0.5

17

0.0

79.69

4

hemorrhoid

57951

4.5

181

0.1

79.29

5

straighteners

28206

2.2

133

0.0

52.20

6

courageousness

8717

0.7

40

0.0

50.86

7

twitter

712447

54.9

3602

1.1

49.81

8

straightener

23324

1.8

137

0.0

41.94

9

colorways

4242

0.3

23

0.0

40.83

10

anticlimaxes

2584

0.2

14

0.0

37.91

11

12

NEW THINGS


tweeting tweeted twitter


(photo) voltaic (cells)


atomizer (as part of apparatus for giving up smoking)


jailbreak (verb: remove limitations on an Apple device)


NEW WORDS


colorway

colorways

aftereffect

(increasingly spelt as one word)


SHOPPING


footwear espadrille sneaker
slingback

huarache


handbags holdalls


chronograph chronographs timepiece timepieces watchstrap
watchmaking


birthstone birthstones


foodstuff


headpins (
jewelry

making)


pantyliner

jerseysSERVICES


locksmith locksmiths
refacing

(for kitchen cabinets)


MONEY


refinance refinancing
remortgages

defrayal
cosigner

loaners


WEDDINGS


bridesmaid boutonnieres honeymoons groomsmen


13

HEALTH AND BEAUTY


periodontist

whitening veneers aligners (both mainly for teeth)


hemorrhoid

hemorrhoids


hairstyles
straightener

straighteners


slimming physique cellulite liposuction
stretchmarks

suntanning


moisturize moisturizes moisturized dehydrators
detoxing


pimples
whiteheads

blackhead blackheads


breakouts (of acne etc)
concealer

concealers

(of acne etc)


tinnitus


RARE DICTIONARY WORDS


accouter

osculate


MORPHOLOGY


humorousness
severeness

sturdiness impecuniousness comfortableness


anxiousness adorableness courageousness
neglectfulness

moldiness

safeness


anticlimaxes chitchats attires apparels
jewelries

jackpots


wagerer

vacationer dandier


acquirable conveyable


dejecting unexceptionally


NAMES (incorrectly included
-

most were filtered out)


spellbinders (a name) circuital (album) android (operating system)


OTHER


frontward proficiently


14

enTenTen08

enTenTen12

lc

Freq

Freq/mill

Freq

Freq/mill

Score

Rank

twelfths

2070

0.6

98

0.0

74.1

1

holograph

3542

1.1

627

0.0

22.0

2

fuehrer

6581

2.0

1220

0.1

21.2

3

declassification

5332

1.6

989

0.1

21.1

4

subtenant

3283

1.0

619

0.0

20.6

5

indemnifying

3654

1.1

772

0.1

18.5

6

libeler

160

0.0

29

0.0

15.4

7

videocassette

3898

1.2

1033

0.1

14.8

8

maunders

188

0.1

39

0.0

14.6

9

palatinates

86

0.0

13

0.0

13.6

10

wardresses

65

0.0

7

0.0

13.6

11

reexports

290

0.1

73

0.0

13.5

12

videodisc

1278

0.4

363

0.0

13.5

13

15

enTenTen08

enTenTen12

lc

Freq

Freq/mill

Freq

Freq/mill

Score

Rank

twelfths

2070

0.6

98

0.0

74.1

1

holograph

3542

1.1

627

0.0

22.0

2

fuehrer

6581

2.0

1220

0.1

21.2

3

declassification

5332

1.6

989

0.1

21.1

4

subtenant

3283

1.0

619

0.0

20.6

5

indemnifying

3654

1.1

772

0.1

18.5

6

libeler

160

0.0

29

0.0

15.4

7

videocassette

3898

1.2

1033

0.1

14.8

8

maunders

188

0.1

39

0.0

14.6

9

palatinates

86

0.0

13

0.0

13.6

10

wardresses

65

0.0

7

0.0

13.6

11

reexports

290

0.1

73

0.0

13.5

12

videodisc

1278

0.4

363

0.0

13.5

13

16

enTenTen12

enTenTen08

lc

Freq

Freq/mill

Freq

Freq/mill

Score

Rank

tweeted

28711

2.2

11

0.0

507.41

1

jewelries

18012

1.4

35

0.0

118.72

2

tweeting

26024

2.0

67

0.0

93.40

3

colorway

6395

0.5

17

0.0

79.69

4

hemorrhoid

57951

4.5

181

0.1

79.29

5

straighteners

28206

2.2

133

0.0

52.20

6

courageousness

8717

0.7

40

0.0

50.86

7

twitter

712447

54.9

3602

1.1

49.81

8

straightener

23324

1.8

137

0.0

41.94

9

colorways

4242

0.3

23

0.0

40.83

10

anticlimaxes

2584

0.2

14

0.0

37.91

11


New things have arrived on the web, more
dramatically than things have left it

17


All except twitter words,
voltaic, jailbreak


Almost all of sample of concordances


spam


Marketing


In between


Lethiferous

words


Osculate

(kiss)


porn spam


Accouter

(dress)


clothing spam

18


11

ness nouns


6 pre
-
empted


comfortableness / comfort


6 plurals


3 emphatically mass nouns


Attires apparels
jewelries


2

er

nouns


Wagerer

(exclusively in pure spam) and
vacationer


2

able adjectives


19


Computer is generating these forms


Non
-
native speakers


Dialect:
Kachru’s

outer circle


India
Pakisatan

Malaysia Kenya Nigeria


20

This, in addendum to modern sedate
safeness

concerns, numberless increases in data sum
total, and rising cost pressures, closest these
organizations with some uncommonly
outstanding topic challenges.

21

A lot of individuals choose for numerous floor
bunny rabbit cages with brings joining the levels.
This grants the bunny rabbit a lot extra room
without borrowing more room inside your
haven. Owning a line flooring inside your bunny
rabbit Cage isn’t a good plan if you
wouldlike

to
give
comfortableness
for your bunny rabbit.
While having a wire bed with a pull out and
makes for simpler maintaining, it’s not all of the
time
necessaryas

bunnies are easily litter box
trained.

22

It is dream of every woman to have a perfect
wardrobe. The thing that tops the list to make
the wardrobe a complete one is a black shoe.
Ladies black shoes add style and versatility to
the
attires
. From casuals to formal black is
the colour that makes the feet stand out from
the crowd.

23


The web isn’t what it used to beSearch engines: friend or foe


Filtering it out


AIRWEB challenges


Hard, and a moving target


if we publish, spammers will smile


The cline


From marketing to gibberish


From language to noise


Content farms


Full of sound and fury / signifying nothing

24