MarkLogic Server: Under The Hood

thumpinsplishInternet και Εφαρμογές Web

18 Νοε 2013 (πριν από 4 χρόνια και 1 μήνα)

116 εμφανίσεις

Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

Unlock Content™

Copyright © 2008 Mark Logic Corporation. All rights reserved.

1

MarkLogic Server: Under The Hood


Mary Holstege

Principal Engineer

Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

MarkLogic Server

XML Server


Special
-
purpose DBMS for XML

Semi
-
structured

Hierarchical


Designed for 100s of TB of XML





Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

How Did We Get Here?

Founder: Christopher Lindblad

MIT

Architect of Ultraseek Server

Intranet seach engine product


Met people that wanted to use a search engine like a database

Rich query language

Guaranteed correctness

Transactions

Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

Consider an Application

Documents + metadata

Documents: rich, variable structure


Want: complex full
-
text search

Want: combined text, metadata, structure
-
aware search

Want: granular ad hoc access

Want: real
-
time query


How do you build it?

Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

Two
-
headed Monster

I’m an RDBMS

Answers are right or wrong

I like to combine small pieces

I allow granular access

Linguistic complexity hurts my brain

I guarantee ACID properties

Updates are visible right away

I’m a search engine

Some answers are better than others

Most pieces of information are large

I can give you the whole document

Structure hurts my brain

I’m optimized for sparse data

Updates are visible… oh, whenever

Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

A Different Approach

Soul of Search Engine: Data Model And Queries


Database: On
-
disk Organization And Transactions


Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

Data Model


Document

Title

Author

Abstract

Section

Section

Footer

Section

Section

Section (cont’d)

Metadata

Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

Data Model

A database for XML . . .

. . . uses the XML Data Model

XML is a tree


Document

Title

Author

Section

Section

Section

Section

Section

Section

First

Last

Metadata

Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

Example Document


<article>


<title>
MarkLogic Server: The Best Place for XML
</title>


<author><first
-
name>
John
</first
-
name><last
-
name>
Kreisa
</last
-
name></author>


<abstract>


Where should one put their XML?
<company>
Mark Logic
</company>
has the
best answer to this question: MarkLogic Server.
. . .


</abstract>


<body>



<section>




<section>

This high performance engine can . . .

</section>



</section>



<section>

Using an inverted index technique . . .

</section>


</body>


<copyright>
Copyright© 2008 Mark Logic Corporation. All rights Reserved.
</copyright>

</article>

Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

What Queries Is It Good At?

1)
Full
-
Text Search

Find all documents that contain the phrase “high performance”.


2)
XML Structure

Find all articles that have an abstract.


3)
XML Semantics

Find all documents that mention the company “Mark Logic”.


4)
All of the above . . .



Find all articles that contain the phrase “high performance” and
mention the company Mark Logic in the abstract.


at the same time

Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

1) Full
-
text Search

Find all documents that contain the phrase “high performance”



<article>


<title>
MarkLogic Server: The Best Place for XML
</title>


<author><first
-
name>
John
</first
-
name><last
-
name>
Kreisa
</last
-
name></author>


<abstract>


Where should one put their XML?
<company>
Mark Logic
</company>
has the
best answer to this question: MarkLogic Server.
. . .


</abstract>


<body>



<section>




<section>

This
high performance
engine can . . .

</section>



</section>



<section>

Using an inverted index technique . . .

</section>


</body>


<copyright>
Copyright© 2008 Mark Logic Corporation. All rights Reserved.
</copyright>

</article>



Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

1) Full
-
text Search

very

high

performance

index

122

0

1

0

0

123

1

0

1

1

124

0

0

0

0

125

0

1

0

0

126

0

1

1

0

127

1

0

0

0

129

1

1

0

0

130

0

1

1

1

Find all documents that contain the phrase “high performance”


Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

1) Full
-
text Search


UNIVERSAL INDEX

“very”

“high”

“performance”

“index”

“high performance”

“very high”

“performance index”











123, 127, 129, 152, 344, 791 . . .

122, 125, 126, 129, 130, 167 . . .

123, 126, 130, 142, 143, 167 . . .

123, 130, 131, 135, 162, 177 . . .

126, 130, 167, 212, 219, 377 . . .

. . .

. . .











Document
References

126, 130, 167, 212, 219, 377 . . .

Find all documents that contain the phrase “high performance”


Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

2) XML Structure

Find all articles that have an abstract



<article>


<title>
MarkLogic Server: The Best Place for XML
</title>


<author><first
-
name>
John
</first
-
name><last
-
name>
Kreisa
</last
-
name></author>


<abstract>


Where should one put their XML?
<company>
Mark Logic
</company>
has the
best answer to this question: MarkLogic Server.
. . .


</abstract>


<body>



<section>




<section>

This high performance engine can . . .

</section>



</section>



<section>

Using an inverted index technique . . .

</section>


</body>


<copyright>
Copyright© 2008 Mark Logic Corporation. All rights Reserved.
</copyright>

</article>



Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

2) XML Structure


UNIVERSAL INDEX

“very”

“high”

“performance”

“index”

“high performance”

<article>

<article>/<abstract>











123, 127, 129, 152, 344, 791 . . .

122, 125, 126, 129, 130, 167 . . .

123, 126, 130, 142, 143, 167 . . .

123, 130, 131, 135, 162, 177 . . .

126, 130, 167, 212, 219, 377 . . .

. . .

. . .











Document
References

126, 130, 167, 212, 219, 377 . . .

Find all articles that have an abstract

Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

3) XML Semantics

Find all documents that mention the company “Mark Logic”



<article>


<title>
MarkLogic Server: The Best Place for XML
</title>


<author><first
-
name>
John
</first
-
name><last
-
name>
Kreisa
</last
-
name></author>


<abstract>


Where should one put their XML?
<company>
Mark Logic
</company>

has the best answer to this question: MarkLogic Server.
. . .


</abstract>


<body>



<section>




<section>

This high performance engine can . . .

</section>



</section>



<section>

Using an inverted index technique . . .

</section>


</body>


<copyright>
Copyright© 2008 Mark Logic Corporation. All rights Reserved.
</copyright>

</article>



Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

3) XML Semantics


UNIVERSAL INDEX

“very”

“high”

“performance”

“index”

“high performance”

<article>

<article>/<abstract>

<company>Mark Logic</









123, 127, 129, 152, 344, 791 . . .

122, 125, 126, 129, 130, 167 . . .

123, 126, 130, 142, 143, 167 . . .

123, 130, 131, 135, 162, 177 . . .

126, 130, 167, 212, 219, 377 . . .

. . .

. . .











Document
References

126, 130, 167, 212, 219, 377 . . .

Find all documents that mention the company “Mark Logic”


Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

4) All Of The Above

Find all articles that contain the phrase “high performance” and
mention the company “Mark Logic” in the abstract



<article>


<title>
MarkLogic Server: The Best Place for XML
</title>


<author><first
-
name>
John
</first
-
name><last
-
name>
Kreisa
</last
-
name></author>


<abstract>


Where should one put their XML?
<company>
Mark Logic
</company>

has the best answer to this question: MarkLogic Server.
. . .


</abstract>


<body>



<section>




<section>

This
high performance
engine can . . .

</section>



</section>



<section>

Using an inverted index technique . . .

</section>


</body>


<copyright>
Copyright© 2008 Mark Logic Corporation. All rights Reserved.
</copyright>

</article>



Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

4) All Of The Above

UNIVERSAL INDEX

“very”

“high”

“performance”

“index”

“high performance”

<article>

<article>/<abstract>

<abstract>/<company>

<company>Mark Logic</







123, 127, 129, 152, 344, 791 . . .

122, 125, 126, 129, 130, 167 . . .

123, 126, 130, 142, 143, 167 . . .

123, 130, 131, 135, 162, 177 . . .

126, 130, 167, 212, 219, 377 . . .

. . .

. . .











Document
References

126, 130, 167, 212, 219, 377 . . .

Find all articles that contain the phrase “high performance” and
mention the company “Mark Logic” in the abstract


Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

Scalar Indexes

UNIVERSAL INDEX

“very”

“high”

“performance”

“index”

“high performance”

<article>

<article>/<abstract>

<abstract>/<company>

<company>Mark Logic</







123, 127, 129, 152, 344, 791 . . .

122, 125, 126, 129, 130, 167 . . .

123, 126, 130, 142, 143, 167 . . .

123, 130, 131, 135, 162, 177 . . .

126, 130, 167, 212, 219, 377 . . .

. . .

. . .











Document
References

126, 130, 167, …

Identify a set of documents based on criteria and then characterize the
set with scalar indexes (float, dateTime, string etc.)


Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

Geospatial, too

UNIVERSAL INDEX

“very”

“high”

“performance”

“index”

“high performance”

<article>

<article>/<abstract>

<abstract>/<company>

<company>Mark Logic</







123, 127, 129, 152, 344, 791 . . .

122, 125, 126, 129, 130, 167 . . .

123, 126, 130, 142, 143, 167 . . .

123, 130, 131, 135, 162, 177 . . .

126, 130, 167, 212, 219, 377 . . .

. . .

. . .











Document
References

126, 130, 167, …

Just a special kind of scalar index, except values are points and scan
operators know about Earth geometry


Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

Universal Index Is Our Hammer


We turn queries into nails


Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

Examples Of Nails

Directories

Exclusive, hierarchical, analogous to file


system, map to URI



Collections

Set
-
based, N:N relationship



Security

Invisible to your app

Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

Many Shapes And Sizes

News Article

Book

Research Report

Slide Presentation

Product Sheet

Operations Manual

Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

Load As Is

XML is self
-
describing


<article>


<title>
MarkLogic Server: . . .
</title>


<author>



<first
-
name>
John
</first
-
name>



<last
-
name>
Kreisa
</last
-
name>


</author>


<abstract>




. . . .

<company>
Mark Logic
</company>


</abstract>


<body>



<section>




<section>

. . .
</section>



</section>



<section>

. . . index . . .

</section>


</body>


<copyright>
Copyright© . . .
</copyright>

</article>




Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

Load As Is

<article>


<title>
MarkLogic Server: . . .
</title>


<author>



<first
-
name>
John
</first
-
name>



<last
-
name>
Kreisa
</last
-
name>


</author>


<abstract>




. . . .

<company>
Mark Logic
</company>


</abstract>


<body>



<section>




<section>

. . .
</section>



</section>



<section>

. . . index . . .

</section>


</body>


<copyright>
Copyright© . . .
</copyright>

</article>




XML is self
-
describing


<article>

<author>

<title>

<abstract>

<body>

<copyright>

<first
-
name>

<last
-
name>

<company>

<section>

<section>

<section>

MarkLogic Server: . . .

John

Kreisa

MarkLogic

. . .
index
. . .

Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

Load As Is

<article>

<title>

<abstract>

<body>

<copyright>

<author>

<first
-
name>

<last
-
name>

<section>

<section>

<section>

<company>

"MarkLogic Server: . . ."

"John"

"Kreisa"

"MarkLogic"

"
. . .
"

"
. . .
"

"
. . .
"


. . .
"

"
. . .
index
. . .
"

XML is self
-
describing


Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

Load As Is

<article>

<title>

<abstract>

<body>

<copyright>

<author>

<first
-
name>

<last
-
name>

<section>

<section>

<section>

<company>

"MarkLogic Server: . . ."

"John"

"Kreisa"

"MarkLogic"

"
. . .
"

"
. . .
"

"
. . .
"


. . .
"

"
. . .
index
. . .
"

XML is self
-
describing


No Schema Needed!


Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

Degrees Of Flexibility

Structure

Ad hoc

Predefined

Queries

Ad hoc

Predefined

IMS

IDMS

Relational

Databases

Search

Engines

MarkLogic

Server

XML

Server

Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

The Query Language

XML

Universal

Index

XQuery


Full
-
Text Search


XML Structure

XML Semantics

Application Logic


Manipulate XML


Render Results

Load As Is

Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

The Programming Language

XML

Universal

Index

XQuery


Full
-
Text Search


XML Structure

XML Semantics

Application Logic


Manipulate XML


Render Results

Load As Is

Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

A Different Approach

Sould of a Search Engine: Data Model And Queries


Database: On
-
disk Organization And Transactions


Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

What’s In A Database?

No tables

No rows

forests . . .


. . . . of trees

Database

Forest
1

Forest
2

Forest
3

Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

The Cluster

Host e
1




Forest
1

Host e
k




Host d
1




Host d
2




Host d
3




Host d
l




Forest
2

Forest
3

Forest
m

Host e
2




Forest
4

Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

What About Updates?

Typical XML document:

10KB


1MB

Referenced by 1,000s to 10,000s of term lists


Search engines are bad at updates

Many indexes to update

Option: Index and Information out of sync

Option: Slow


We want

High throughput

Transactions (ACID)

So how do we avoid updates?

Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

Solution: Temporal Database

No update! No delete!


Only insert and read
-
at
-
a
-
time


Every document has two timestamps

“created”, “expired”



Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

Temporal Database


520

528

Create
a.xml

Create
b.xml

Update
a.xml

Update
a.xml

Delete
b.xml

...

Query

Query

Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

The Cluster

Host e
1




Forest
1

Host e
k




Host d
1




Host d
2




Host d
3




Host d
l




Forest
2

Forest
3

Forest
m

Host e
2




Forest
4

Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

Host


A Single Forest

Stand
1


Stand
2


Stand
n




Buffer


Forest
k

Buffer


Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

Host


1. Create A New Tree

Stand
1


Stand
2


Stand
n




Buffer


Forest
k

Buffer


Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

Host


2. Expire Trees

Stand
1


Stand
2


Stand
n




Buffer


Forest
k

Buffer


Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

Host


3. Save A Buffer To Disk

Stand
1


Stand
2


Stand
n




Buffer


Forest
k

Buffer


Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

Host



4. Optimization: Merge Stands

Buffer


Forest
k

Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

The Four Forest Operations

1.
Create a new document


Into a buffer


2.
Mark a document as expired


Memory
-
mapped document timestamps per stand


3.
Write buffer out to disk


Our buffers are 100s of megabytes


For performance, double buffer


4.
Merge


Background process


Optimization: reduces number of stands in forest






Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

Consistency And Throughput

2
-
phase commit

Transactions span forests


Recovery

Forest Journals


Lock
-
free queries

Use the search engine at a point
-
in
-
time

Increased throughput

Time travel?


Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

A Different Approach

Sould of a Search Engine: Data Model And Queries


Database: On
-
disk Organization And Transactions


Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

Summary

XML as data model

Ad hoc schema


A search engine core

Universal Index


Temporal transaction model

High throughput while keeping . . .


Performance and scalability of a search engine

Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

Mary Holstege

Principal Engineer

mary@marklogic.com

t: 650.655.2336

f: 650.655.2310

Thank You

Copyright © 2008 Mark Logic Corporation. All rights reserved.

‹#›

The Cluster

Host e
1




Forest
1

Host e
k




Host d
1




Host d
2




Host d
3




Host d
l




Forest
2

Forest
3

Forest
m

Host e
2




Forest
4