Inside the PostgreSQL Query Optimizer - Neil Conway

manlybluegooseΔιαχείριση Δεδομένων

27 Νοε 2012 (πριν από 4 χρόνια και 4 μήνες)

206 εμφανίσεις

Insidethe
PostgreSQLQueryOptimizer
NeilConway
neilc@samurai.com
FujitsuAustraliaSoftwareTechnology
PostgreSQLQueryOptimizerInternals–p.1
Outline
Introductiontoqueryoptimization
Outlineofqueryprocessing
Basicplanneralgorithm
PlanningspecificSQLconstructs
Questionswelcomethroughout
PostgreSQLQueryOptimizerInternals–p.2
Whatisqueryoptimization?
SQLisdeclarative;theuserspecifieswhatthequery
returns,nothowitshouldbeexecuted
TherearemanyequivalencesinSQL:
Joinscanbeappliedinanyorder
Predicatescanbeevaluatedinanyorder
Subselectscanbetransformedintojoins
Severaldifferentmethodsofdoingthesameoperation:
Threecorejoinalgorithms(nestedloops,hashjoin,
mergejoin)
Twoaggregationalgorithms(hashing,sorting)
Twoscanalgorithms(indexscan,sequentialscan)
Foranon-trivialquerytherearemanyalternativeplans
PostgreSQLQueryOptimizerInternals–p.3
Outlineofqueryprocessing
ClientconnectstopostmasterviaTCPorunixdomain
socket,communicatesviafrontend-backendprotocol
forknewbackendtohandleconnection
Authenticationinnewbackend
Entersimplequeryloop
Clientsubmitsquery
Backendexecutesquery,returnsresultset
PostgreSQLQueryOptimizerInternals–p.4
Queryloop
1.Lexandparse—flex,bison
Input:querystring
Output:“rawparsetree”
Nodatabaseaccessorsemanticanalysis
2.Analysis
Input:rawparsetree
Output:Query
Essentially,annotatedparsetree—dodatabase
lookupsformetadata
3.Rewriter
Input:Query
Output:OneormoreQuery
Applyrewriterules:CREATERULE,CREATEVIEW
PostgreSQLQueryOptimizerInternals–p.5
Queryloop,cont.
Alreadydone:weunderstandthesyntaxofthequery
andhavelookedupassociatedmetadataandapplied
somebasicsemanticchecks
Ifthisisa“utilitycommand”(CREATE,ALTER,DROP,
etc.),handofftotheimplementationofthecommand
Otherwise,remainingwork:
Decidehowtoevaluatethequery,produceaPlan
EvaluatethePlanandreturnresultsettoclient
Thequeryplanneriswhatdeterminesthebestwayto
evaluateaquery;alsoknownasthe“queryoptimizer”.
Thisrequires:
1.Determiningthesetofpossibleplans
2.Choosingthe“best”planfromthisset
PostgreSQLQueryOptimizerInternals–p.6
Representationofqueryplans
Werepresent“how”toexecuteaqueryasatreeofplan
nodes;eachnodeisasingleoperation(join,diskscan,
sort,etc.)
Tuplesflowfromtheleavesofthetree(diskscans)up
totheroot
Resultsdeliveredtoparentnode“ondemand”
Togetarow,anode“pulls”onitschildnode,which
inturnspullsonitschildnodesasneeded
Toproduceresultset,executorjustpullsonrootnode
Gutsofqueryexecutionisintheimplementationof
plannodes
Ingeneral,plannodesareasymmetric:leftandright
inputstreateddifferently
PostgreSQLQueryOptimizerInternals–p.7
Examplequery
DatabasestoresCVScommithistory
Acommitmodifiesnfiles;eachsuchmodificationisan
“action”
Query:findthetimestampofthelatestcommitto
modifygivenafilef
SELECTc.tstamp
FROMcommitsc,actionsa
WHEREa.fileIN
(SELECTidFROMfiles
WHEREpath=’...’)
ANDa.commit_id=c.id
ORDERBYc.tstampDESC
LIMIT1;
targetlist
rangetable
qualifier
IN-clausesubquery
joinpredicate
sortorder
limitexpression
PostgreSQLQueryOptimizerInternals–p.8
Examplequeryplan
LIMIT
SORT
key: commits.tstamp
JOIN
method: nested loops
key: action.commit_id = commit.id
JOIN
method: nested loops
key: file.id = action.file
AGGREGATE
method: hashing
key: files.id
SCAN files
method: index scan
key: path = ‘...’
SCAN actions
method: index scan
key: file = LEFT.id
SCAN commits
method: index scan
key: id = LEFT.commit_id
PostgreSQLQueryOptimizerInternals–p.9
Whatmakesagoodplan?
Theplannerchoosesbetweenplansbasedontheir
estimatedcost
Assumption:diskIOdominatesthecostofquery
processing.Therefore,picktheplanthatrequiresleast
diskIO
RandomIOis(much)moreexpensivethan
sequentialIOonmodernhardware
EstimateI/Orequiredbytryingtopredictthesizeof
intermediateresultsets,usingdatabasestatistics
gatheredbyANALYZE
Thisisanimperfectscience,atbest
Distinguishbetween“startupcost”(IOsrequiredforfirst
tuple)and“totalcost”
PostgreSQLQueryOptimizerInternals–p.10
Generaloptimizationprinciples
Thecostofanodeisafunctionofitsinput:thenumber
ofrowsproducedbychildnodesandthedistributionof
theirvalues.Therefore:
1.Reorderingplannodeschangeseverything
2.Apoorchoiceneartheleavesoftheplantreecould
spelldisaster
Keepthisinmindwhendebuggingpoorly
performingplans
3.Applypredicatesearly,soastoreducethesizeof
intermediateresultsets
Worthkeepingtrackofsortorder—givensortedinput,
certainplannodesarecheapertoexecute
Planningjoinseffectivelyisessential
PostgreSQLQueryOptimizerInternals–p.11
Planneralgorithm
Conceptually,threephases:
1.Enumeratealltheavailableplans
2.Assessthecostofeachplan
3.Choosethecheapestplan
Naturallythiswouldnotbeveryefficient
“SystemRalgorithm”iscommonlyused—adynamic
programmingalgorithminventedbyIBMinthe1970s
Basicidea:find“good”plansforasimplifiedquerywith
njoins.Tofindgoodplansforn+1joins,joineachplan
withanadditionalrelation.Repeat
PostgreSQLQueryOptimizerInternals–p.12
SystemRalgorithm
1.Considereachbaserelation.Considersequentialscan
andavailableindexscans,applyingpredicatesthat
involvethisbaserelation.Remember:
Cheapestunorderedplan
Cheapestplanforeachsortorder
2.Whilecandidateplanshavefewerjoinsthanrequired,
joineachcandidateplanwitharelationnotyetinthat
plan.Retain:
Cheapestunorderedplanforeachdistinctsetof
relations
Cheapestplanwithagivensortorderforeach
distinctsetofrelations
PostgreSQLQueryOptimizerInternals–p.13
SystemRalgorithm,cont.
Grouping(aggregation)andsortingisdoneattheend
Consider“left-deep”,“bushy”,and“right-deep”plans
(someplannersonlyconsiderleft-deepplans)
JOIN
JOIN
A
JOIN
B
C
D
JOIN
JOIN
JOIN
A
B
C
D
JOIN
A
JOIN
B
JOIN
C
D
Thenumberofplansconsideredexplodesasthe
numberofjoinsincreases;forquerieswithmanyjoins(≥12bydefault),ageneticalgorithmisused(“GEQO”)
Non-exhaustive,non-deterministicsearchof
possibleleft-deepjoinorders
PostgreSQLQueryOptimizerInternals–p.14
Planningouterjoins
Outerjoin:likeaninnerjoin,exceptincludeunmatched
jointuplesintheresultset
Innerjoinoperatorisbothcommutativeandassociative:
A✶B≡B✶A,A✶(B✶C)≡(A✶B)✶C
Ingeneral,outerjoinsareneitherassociativenor
commutative,sowecan’treorderthem
Maindifferenceisfeweroptionsforjoinorder;apairof
relationsspecifiedbyOUTERJOINiseffectivelya
singlebaserelationintheplanneralgorithm
Sometimesouterjoinscanbeconvertedtoinnerjoins:
SELECT
*
FROMaLEFTJOINbWHEREb.x=k
Tip:youcanforcejoinorderforinnerjoinsbyusing
JOINsyntaxwithjoin_collapse_limitsetto1
PostgreSQLQueryOptimizerInternals–p.15
Planningsubqueries
Threetypesofsubqueries:IN-clause,FROM-list,and
expression
Wealways“pullup”IN-clausesubqueriestobecomea
specialkindofjoinintheparentquery
WetrytopullupFROM-listsubqueriestobecomejoins
intheparentquery
Thiscanbedoneifthesubqueryissimple:no
GROUPBY,aggregates,HAVING,ORDERBY
Otherwise,evaluatesubqueryviaseparateplan
node(SubqueryScan)—akintoasequentialscan
PostgreSQLQueryOptimizerInternals–p.16
FROM-listsubqueryexample
SELECT
*
FROMt1,
(SELECT
*
FROMt2WHEREt2.x=10)t2
WHEREt1.id=t2.id;
--convertedbytheoptimizerinto
SELECT
*
FROMt1,t2
WHEREt1.id=t2.idandt2.x=10;
Subquerypullupallowstheplannertoreuseallthe
machineryforoptimizingjoins
Integratingsubqueryqualifiersintotheparentquery
canmeanwecanoptimizetheparentquerybetter
PostgreSQLQueryOptimizerInternals–p.17
Planningexpressionsubqueries
ProducenestedPlanbyrecursiveinvocationofplanner
An“uncorrelated”subquerydoesnotreferenceany
variablesfromitsparentquery;itwillthereforeremain
constantforagivendatabasesnapshot.
SELECTfooFROMbarWHEREbar.id=
(SELECTbaz.idFROMbaz
WHEREbaz.quux=100);
Ifuncorrelated,onlyneedtoevaluatethesubquery
onceperparentquery
$var=SELECTidFROMbazWHEREquux=100;
SELECTfooFROMbarWHEREid=$var;
Ifcorrelated,weneedtorepeatedlyevaluatethe
subqueryduringtheexecutionoftheparentquery
PostgreSQLQueryOptimizerInternals–p.18
Planningfunctions
Plannermostlytreatsfunctionsas“blackboxes”
Forexample,set-returningfunctionsintheFROMlist
arerepresentedasaseparateplannode(FunctionScan)
Can’teffectivelypredictthecostoffunction
evaluationorresultsizeofaset-returningfunction
Wecaninlineafunctioncallif:
DefinedinSQL
Usedinanexpressioncontext(notFROMlist—room
forimprovement)
Sufficientlysimple:“SELECT...”
Ifinvokedwithall-constantparametersandnotmarked
“volatile”,wecanpreevaluateafunctioncall
PostgreSQLQueryOptimizerInternals–p.19
Functioninliningexample
CREATEFUNCTIONmul(int,int)RETURNSintAS
‘SELECT$1
*
$2’LANGUAGEsql;
SELECT
*
FROMemp
WHEREmul(salary,age)>1000000;
--afterfunctioninlining,essentially
SELECT
*
FROMemp
WHERE(salary
*
age)>1000000;
Theinlinedformofthequeryallowstheoptimizerto
lookinsidethefunctiondefinitiontopredictthenumber
ofrowssatisfiedbythepredicate
Alsoavoidsfunctioncalloverhead,althoughthisis
smallanyway
PostgreSQLQueryOptimizerInternals–p.20
Planningsetoperations
Planningforsetoperationsissomewhatprimitive
Generateplansforchildqueries,thenaddanodeto
concatenatetheresultsetstogether
Somesetoperationsrequiremorework:
UNION:sortandremoveduplicates
EXCEPT[ALL],INTERSECT[ALL]:sort
andremoveduplicates,thenproduceresultsetviaa
linearscan
Notethatweneverconsideranyalternatives,so
planningisprettysimple(patcheswelcome)
PostgreSQLQueryOptimizerInternals–p.21
Potentialimprovements
Hard:
Databasestatisticsforcorrelationbetweencolumns
Functionoptimization
RewriteGEQO
Crazy:
Onlinestatisticsgathering
Executor→optimizeronlinefeedback
Parallelqueryprocessingonasinglemachine(one
queryonmultipleCPUsconcurrently)
Distributedqueryprocessing(overthenetwork)
PostgreSQLQueryOptimizerInternals–p.22
Questions?
Thankyou.
PostgreSQLQueryOptimizerInternals–p.23
UsingEXPLAIN
EXPLAINprintstheplanchosenforagivenquery,plus
theestimatedcostandresultsetsizeofeachplannode
Primaryplannerdebuggingtool;EXPLAINANALYZE
comparesplanner’sguessestoreality
Executesthequerywithper-plan-node
instrumentation
PostgreSQLQueryOptimizerInternals–p.24
EXPLAINoutput
EXPLAINANALYZESELECTc.tstampFROMcommitsc,actionsaWHEREa.fileIN
(SELECTidFROMfilesWHEREpath=‘...’)
ANDa.commit_id=c.idORDERBYc.tstampDESCLIMIT1;
Limit(cost=135.79..135.80rows=1width=8)
(actualtime=4.458..4.459rows=1loops=1)
->Sort(cost=135.79..135.84rows=20width=8)
(actualtime=4.455..4.455rows=1loops=1)
SortKey:c.tstamp
->NestedLoop(cost=5.91..135.36rows=20width=8)
(actualtime=0.101..4.047rows=178loops=1)
->NestedLoop(cost=5.91..74.84rows=20width=4)
(actualtime=0.078..0.938rows=178loops=1)
->HashAggregate(cost=5.91..5.91rows=1width=4)
(actualtime=0.050..0.052rows=1loops=1)
->IndexScanonfiles(cost=0.00..5.91rows=1width=4)
(actualtime=0.035..0.038rows=1loops=1)
->IndexScanonactionsa(cost=0.00..68.68rows=20width=8)
(actualtime=0.022..0.599rows=178loops=1)
->IndexScanoncommitsc(cost=0.00..3.01rows=1width=12)
(actualtime=0.012..0.013rows=1loops=178)
Totalruntime:4.666ms
PostgreSQLQueryOptimizerInternals–p.25