IBM Streams Processing Language Specification

radiographerfictionData Management

Oct 31, 2013 (3 years and 11 months ago)

146 views

IBM InfoSphere Streams
Version 2.0.0.4
IBM Streams Processing Language
Specification
￿￿￿
IBM InfoSphere Streams
Version 2.0.0.4
IBM Streams Processing Language
Specification
￿￿￿
Note
Before using this information and the product it supports,read the general information under “Notices” on page 71.
Edition Notice
This document contains proprietary information of IBM.It is provided under a license agreement and is protected
by copyright law.The information contained in this publication does not include any product warranties,and any
statements provided in this manual should not be interpreted as such.
You can order IBM publications online or through your local IBM representative.
v To order publications online,go to the IBM Publications Center at www.ibm.com/e-business/linkweb/
publications/servlet/pbi.wss
v To find your local IBM representative,go to the IBM Directory of Worldwide Contacts at www.ibm.com/
planetwide
When you send information to IBM,you grant IBM a nonexclusive right to use or distribute the information in any
way it believes appropriate without incurring any obligation to you.
© Copyright IBMCorporation 2009,2012.
US Government Users Restricted Rights – Use,duplication or disclosure restricted by GSAADP Schedule Contract
with IBM Corp.
Summary of changes
This topic describes updates to this documentation for IBM
®
InfoSphere
®
Streams
Version 2.0 (all releases).
Note:The following revision characters are used in the InfoSphere Streams
documentation to indicate updates for Version 2.0.0.4:
v In PDF files,updates are indicated by a vertical bar (|) to the left of each
new or changed line of text.
v In HTML files,updates are surrounded by double angle brackets
(>> and <<).
Updates for Version 2.0.0.4 (Version 2.0,Fix Pack 4)
v Information about the timestamp and blob types is updated.For more
information,see “Primitive Types” on page 7.
Updates for Version 2.0.0.3 (Version 2.0,Fix Pack 3)
v The following partition eviction policies are defined for partitioned windows:
– partitionAge(float64 n)
– partitionCount(uint32 c)
– tupleCount(uint32 t)
For more information about these policies,see “Partitioning” on page 37.
v A standard order is defined for the elements in the window clause.For more
information,see “Window clause” on page 35.
Updates for Version 2.0.0.2 (Version 2.0,Fix Pack 2)
In the “Type conversions” on page 14 topic,the reference to the library function for
flexible tuple conversions is corrected to assignFrom.
Updates for Version 2.0.0.1 (Version 2.0,Fix Pack 1)
This guide was not updated for Version 2.0.0.1.
© Copyright IBM Corp.2009,2012
iii
iv
IBM InfoSphere Streams Version 2.0.0.4:IBM Streams Processing Language Specification
Abstract
This document is the official specification for the IBM Streams Processing
Language (SPL),the programming language for IBM InfoSphere Streams.SPL is a
streaming language for general,scalable,continuous applications that can run on a
cluster.An SPL program describes a data flow graph,where the vertices are
operator instances and the directed edges are streams.SPL is extensible by
allowing users to define their own primitive operators in a native language,and by
allowing users to define their own parameterizable composite operators in SPL
itself.Furthermore,SPL permits users to dynamically compose independently
launched streaming programs.Thanks to its generality,SPL makes it easy to
express applications from a large variety of domains,including but not limited to
financial engineering and high-frequency trading,transportation monitoring,
manufacturing control,security,healthcare,and so on.
© Copyright IBM Corp.2009,2012
v
vi
IBM InfoSphere Streams Version 2.0.0.4:IBM Streams Processing Language Specification
Contents
Summary of changes.........iii
Abstract...............v
Chapter 1.Introduction........1
Language overview............2
Summary of changes from SPADE to SPL.....3
Grammar notation............4
Lexical syntax..............4
Chapter 2.Types...........7
Primitive Types.............7
Composite types.............11
Collections..............11
Tuples...............12
Type definitions.............13
Value semantics.............14
Type conversions............14
Chapter 3.Expression language....17
Expression operators...........17
Subscripts and slicing..........18
Mapped operators...........19
Runtime errors.............20
Statements...............21
Functions...............22
SPL functions.............23
Native functions............23
Pass-by-reference...........24
Side-effects..............25
Chapter 4.Operator invocations....29
Operator invocation head..........32
Logic clause..............33
Port mutability............34
Window clause.............35
Punctuation.............36
Tumbling windows...........36
Sliding windows...........37
Partitioning.............37
History...............38
Param clause..............38
Output clause.............38
Config clause..............39
Chapter 5.Operator definitions....41
Stream and operator instance names......41
Composite operators...........42
Graph clause.............43
Config clause.............44
Operator parameters...........45
Operators as parameters to composite operators 46
Other operator parameters........46
Operator parameter passing semantics....47
Operator parameter modes........48
Primitive operators............49
Runtime APIs............50
Compile-time APIs...........50
Chapter 6.Program structure.....53
Compile-time entities...........53
Compilation and deployment........55
Dynamic application composition.......58
Embedded documentation.........60
Chapter 7.VWAP example......63
Chapter 8.Grammar overview.....65
Chapter 9.Reference........69
Notices..............71
Index...............75
© Copyright IBM Corp.2009,2012
vii
viii
IBM InfoSphere Streams Version 2.0.0.4:IBM Streams Processing Language Specification
Chapter 1.Introduction
This document specifies the Streams Processing Language (SPL),the language for
InfoSphere Streams,which is IBM's high-performance distributed stream
processing middleware [1 on page 69].First and foremost,SPL is a distributed data
flow composition language.An SPL program describes a directed graph,where
each edge is a stream,and each vertex is an operator invocation.A stream is an
infinite sequence of tuples,and each time an operator invocation receives a tuple
on one of its input streams,it fires,producing some number of tuples on its output
streams.SPL targets scalable data analytics applications and is suitable for many
domains.It achieves this goal by combining a simple and flexible flow language
with an extension mechanism for integrating native code as fast,flexible,and
reusable stream operators.This combination makes it possible to naturally express
algorithms from a variety of streaming languages that are less general and more
specialized to a particular domain [6 on page 69].
Since high performance is a central motivation for distributed stream processing in
general and for SPL in particular,SPL is designed to give the programmer control
over the most performance-critical aspects of their code.In particular,the
programmer directly controls the graph topology and the data representations,and
can choose to control the threading model,placement,physical layout,etc.At the
same time,SPL is high-level by keeping the low-level details of operator
implementations in native code,while exposing enough information to the
compiler to enable optimizations.The combined goals of controlling data
representations for performance and of keeping code high-level and reusable are
addressed by SPL's strong static type system.To enable programmers who are
familiar with traditional languages to transition to stream programming,SPL
reuses concepts and syntax from traditional languages.In particular,SPL's type
system and expression sublanguage look a lot like those of C,Java,Python,and so
on.On the other hand,SPL's design firmly focuses on streaming.In particular,the
primary objective of an SPL operator is to process streams,and streams transport
pure data.To this end,SPL itself is a streaming language,not an object-oriented
language,although users can of course use SPL's extension features to employ
code written in an object-oriented language.
This document intentionally focuses only on the language,and omits information
specific to the compiler,libraries,or middleware.See the following companion
documents for documentation beyond the language specification:
v IBM Streams Processing Language Standard Toolkit Reference:A toolkit is a library
that bundles up functions and operators.The standard library encompasses the
toolkits that ship with the compiler.
v IBM Streams Processing Language Config Reference:A config is a clause in SPL that
configures how an operator is implemented or deployed,and is usually
associated with runtime and environment issues.For example,a placement
config constrains the host an operator runs on.The IBM InfoSphere Streams Config
Reference describes the configs supported by IBM Streams.
v IBM Streams Processing Language Operator Model Reference:An operator model is
the set of properties that characterize a reusable operator.For example,the
operator model for a Filter operator could specify that it takes an expression
that evaluates to boolean for the filter condition.Operator models are written in
XML,and the operator model reference describes the XML schema.
© Copyright IBM Corp.2009,2012
1
v IBM Streams Processing Language Toolkit Development Reference:This document is a
guide to writing toolkits.It includes a description for how to write operators
and functions in C++ or Java for use from SPL code.
This document is sprinkled with paragraphs containing auxiliary information:
v Practical advice:Best practices and conventions for users.
v Implementation note:Notes about how the compiler or runtime implements a
feature.
v For SPADE users:Comparisons between SPL and its predecessor language
SPADE.
v For language experts:Terminology from the programming language community.
In addition,there are numerous code examples.IBM's SPL compiler is
continuously tested on these examples.While some examples are semantically
incomplete (for example,using undefined identifiers),all examples are syntactically
valid.
Language overview
The following figure shows an example stream graph.The vertices of the stream
graph are operator invocations,and the edges are streams.An operator invocation
defines output streams by invoking a stream operator on input streams.
The following code shows how this stream graph could be written in SPL,with
one operator invocation per vertex.
composite SaleJoin {//1
graph//2
stream<rstring buyer,rstring item,decimal64 price> Bid = FileSource() {//3
param file:"BidSource.dat";format:csv;//4
}//5
stream<rstring seller,rstring item,decimal64 price> Ask = FileSource() {//6
param file:"AskSource.dat";format:csv;//7
}//8
stream<rstring buyer,rstring seller,rstring item> Sale = Join(Bid;Ask) {//9
window Bid:sliding,time(30);//10
Ask:sliding,count(50);//11
param match:Bid.item == Ask.item && Bid.price >= Ask.price;//12
FileSource
Sale
Join
Ask
Bid
FileSink
FileSource
Figure 1.Stream graph for the sale-join application example
2
IBM InfoSphere Streams Version 2.0.0.4:IBM Streams Processing Language Specification
output Sale:item = Bid.item;//13
}//14
() as Sink = FileSink(Sale) { param file:"Result.dat";format:csv;}//15
}
An SPL program consists of one or more composite operators.A composite
operator defines a stream graph,which consists of operator invocations.There is a
one-to-one correspondence between the vertices in the figure and operator
invocations in the example code.The body of an operator invocation customizes
the way the operator works.For example,the window,param,and output clauses in
Lines 10-13 customize how the Join operator is invoked.For more information
about the Join operator and other operators,see IBM Streams Processing Language
Standard Toolkit Reference.
This language specification is written bottom-up,starting from types such as
rstring or int32 (see topic Chapter 2,“Types,” on page 7).Values of these types
are manipulated by expressions,such as Bid.item == Ask.item (see topic
Chapter 3,“Expression language,” on page 17).But the core concept of SPL is the
operator invocation,such as Sale = Join(Bid;Ask) (see topic Chapter 4,
“Operator invocations,” on page 29).SPL allows users to define their own
primitive or composite operators,such as SaleJoin (see topic Chapter 5,“Operator
definitions,” on page 41).Finally,an SPL program specifies functions,operators,
and types in namespaces (see topic Chapter 6,“Program structure,” on page 53).
An extended example is listed in the topic Chapter 7,“VWAP example,” on page
63,and the overview of the grammar that SPL uses is listed in the topic Chapter 8,
“Grammar overview,” on page 65.
Summary of changes from SPADE to SPL
For SPADE users:The main influence for SPL was its predecessor language
SPADE.It is described in the IBM InfoSphere Streams product documentation and
at http://domino.research.ibm.com/comm/research_projects.nsf/pages/
esps.spade.html.There is also a research paper about SPADE in SIGMOD'08 [4 on
page 69].The SPL specification is self-contained,so it should be accessible to
people unfamiliar with SPADE.The design of SPL has the same spirit as SPADE;
the primary goal is permitting efficient implementation on distributed hardware.
Most changes aim to make the language simpler and more uniform.When given a
choice between code readability and writeability,SPL usually opts for readability.
SPL adds some features missing from SPADE,removes others,and changes many
features.Newly added features include composite operators and a richer data
model (tuple,map,decimal,timestamp,and so on),as well as many less visible
performance-oriented enhancements to the SPL runtime enabled by the visible
language changes.Removed features include bundles,bulk functions,and the
preprocessor.The most visible change in SPL is the syntax,which has become
more readable,especially for programmers familiar with C or Java.For example,in
SPADE,the operator invocation for Sale looks like this:
stream Sale(buyer:String,seller:String,item:String)#1
:= Join(Bid <time(30)>;Ask <count(50)>)#2
[ $1.item = $2.item & $1.price >= $2.price ]#3
{ item:= $1.item }#4
Chapter 1.Introduction
3
Grammar notation
This language specification adopts a flavor of Backus-Naur Form (BNF) to describe
the syntax of language features,following the following conventions:
Table 1.Notations used in SPL
Notations Meaning
Italics Non-terminal.
ALL_CAPS_ITALICS Token.For example,ID for identifiers.
’fixed-width font’ Verbatim text quoted to avoid confusion with meta-characters.For
example:'('
(...) Grouping - to disambiguate meta-syntax precedence.
...|...Alternatives - to match syntax on either left or right of the bar.
...
?
The syntax preceding this notation is optional.
...
*
The syntax preceding this notation is repeated zero or more times.
...
+
The syntax preceding this notation is repeated one or more times.
...
*,
Comma-separated list of zero or more items.
...
+,
Comma-separated list of one or more items.
...
*;
Semicolon separated list of zero or more items.
...
+;
Semicolon separated list of one or more items.
...
+.
Period separated list of one or more items.
nonTerminal::=...Rule definition.
Lexical syntax
SPL files are written in Unicode using UTF-8 encoding.Most syntactic elements,
including keywords and identifiers,use the subset of Unicode that overlaps with
ASCII characters (the subset 32-126 of the Latin-1 alphabet ISO 8859-1).Restricting
identifiers to ASCII facilitates interoperability with native languages like C++ and
with file systems.The only two constructs where other Unicode characters are
valid are Unicode string literals and comments.Identifiers start with an ASCII
letter or underscore,followed by ASCII letters,digits,or underscores.Identifiers of
formal parameters of composite operators start with a dollar.The syntax for literal
values of primitive types (numbers,booleans,strings,and other types) is similar to
that of Java or C++,and is described in the topic “Primitive Types” on page 7.
SPL has two forms of comments:single-line comments (from//to the end of the
line) and delimited comments (between/* and */).Delimited comments can span
multiple lines,and can be followed by regular code in the same line.
SPL syntax is not sensitive to indentation or line breaks.SPL syntax is
case-sensitive.For example,mud and Mud are different identifiers.
An SPL keyword,such as if or stream,cannot be used as an identifier.SPL uses
lexical scoping for identifiers.In other words,a declaration in an inner scope
shadows declarations of the same identifier in statically enclosing scopes.SPL
allows textual uses of function and stream names to precede their definition,in
order to support recursive functions and cyclic stream graphs.SPL does not permit
synonymous entities of different categories in the same scope.For example,the
code f:for(;;) { f f = new f().f();} is valid in Java,even though the same
identifier refers to a label,a type,a variable,a constructor,and a method.In SPL,
on the other hand,it is a compile-time error if a program declares both an operator
named f and a function named f in the same scope.As another example,it is a
4
IBM InfoSphere Streams Version 2.0.0.4:IBM Streams Processing Language Specification
compile-time error if a program declares both a type named t and a variable
named t in the same scope.In other words,SPL does not segregate scopes by
identifier categories,unlike,for example Java.That also means that identifiers in
inner scopes shadow identifiers in outer scopes even when they are of a different
category.For example,a locally declared stream s hides any function s in an outer
scope.As we will see later,namespaces and composite operators provide not only
a scope but also support disambiguation with qualified names,and data hiding
with private members.
For SPADE users:In SPADE,single-line comments started with#and delimited
comments were surrounded by#*....*#.
Chapter 1.Introduction
5
6
IBM InfoSphere Streams Version 2.0.0.4:IBM Streams Processing Language Specification
Chapter 2.Types
SPL has several primitive types tailored for streaming,and a few composite types
inspired by scripting languages but statically checked.The figure in the topic
“Primitive Types” shows all types arranged in a hierarchy.This section describes
the types,how to define them,and how to convert values between different types.
Primitive Types
A primitive type,such as int32 or rstring,is one that is not composed of other
types.By supporting many primitive types,SPL gives the user fine control over
data representation,which is crucial for performance in high-volume streams.Tight
representation is important both to keep the data on the wire small,and to reduce
serialization and deserialization overheads.SPL supports the following primitive
types:
Table 2.Primitive types in SPL
Type Representation
boolean true or false
enum User-defined enumeration of identifiers
intb Signed b-bit integer.The signed integer types can be:
int8 -128 to 127
int16 -32,768 to 32,767
int32 -2,147,483,648 to 2,147,483,647
int64 -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807
uintb Unsigned b-bit integer.The unsigned integer types can be:
uint8 0 to 255
uint16 0 to 65,535
uint32 0 to 4,294,967,295
uint64 0 to 18,446,744,073,709,551,615
floatb IEEE 754 binary b-bit floating point number.The binary floating types can be:
float32 Single-precision,equivalent to float in Java
float64 Double-precision,equivalent to double in Java
decimalb IEEE 754 decimal b-bit floating point number.The decimal floating types can be:
decimal32 Significand 7 decimal digits,exponents 10
-95
to 10
-94
decimal64 Significand 16 decimal digits,exponents 10
-383
to 10
-384
decimal128 Significand 34 decimal digits,exponents 10
-6,143
to 10
6,144
complexb 2b-bit complex number.The complex types can be:
complex32 Both real and imaginary parts are float32
complex64 Both real and imaginary parts are float64
timestamp Point in time,with nanosecond precision
rstring String of raw 8-bit characters
ustring String of UTF-16 Unicode characters,based on ICU library
blob Sequence of raw bytes
rstring[n] Bounded-length string of at most n raw 8-bit characters
© Copyright IBM Corp.2009,2012
7
The names of numeric types include their bit-width to make the naming consistent
and to avoid unwieldy names such as “long long unsigned int”.Users can also
define their own type names if desired (see topic “Type definitions” on page 13).
An example for an enumeration is:
type LogLevel = enum { error,info,debug,trace };//1
Any of the identifiers error,.....,trace can be used where a value of enumeration
LogLevel is expected.The scope of the identifiers error,.....,trace is the same as
the scope containing the type definition.Enumerations are ordered (they permit
comparison with <,>,<=,and >=) but not numeric (they do not permit arithmetic
with +,−,*,and so on).
Like in C/Java,literals for int,uint,float,and decimal can have optional type
suffixes.For example,123 is signed (int32) whereas 123u is unsigned (uint32).One
suffix indicates the kind of number.
Suffix Meaning
s Signed integer (default for integer literals)
u Unsigned integer
f Binary floating-point (default for floating point
literals)
d Decimal floating-point
Another suffix indicates the number of bits.
Suffix Meaning
b (byte) 8-bit
h (halfword) 16-bit
w (word) 32-bit (default for integer literals)
l (long) 64-bit (default for floating point literals)
q (quad-word) 128-bit
(signed) (unsigned) (float) (decimal)
-int8
16
32
64
-int
-int
-int
- int8
16
32
64
u
-uint
-uint
-uint
float32
float64
-
-
32
64
-decimal
-decimal
-decimal128
complex32
complex64
-
-
(integral)
(numeric)
(floatingpoint) (complex) rstring
boolean
ustring list
set
map
enum timestamp (string) blob (collection) tuple
(composite)(primitive)
(any)
Figure 2.Hierarchy of SPL types
8
IBM InfoSphere Streams Version 2.0.0.4:IBM Streams Processing Language Specification
Some more examples for literals with type suffix:0.0005 (float64),0.5e−3
(float64),3.5d (decimal64),3.5w (float32),123d (decimal64),123dq (decimal128).
SPL supports hexadecimal literals.One can specify a hexadecimal literal with a 0x
prefix.Valid suffixes for hexadecimal literals are s (signed integer) and u (unsigned
integer).By default a hexadecimal literal is a signed integer.Its data length is
determined by the number of hexadecimal digits specified.The data length
includes any leading zeros.A maximum of 16 hexadecimal digits are supported
(int64,uint64).Specifying more than 16 hexadecimal digits will result in an error.
Some examples of hexadecimal literals:0xf (int8),0x00fu (uint16),0x12345
(int32),−0x12345s (int32),0x0123456789ABCDEF (int64)
String literals are written in double-quotes.SPL supports two string types,
"Unicode"and"raw".While ustring contains Unicode characters encoded as
UTF16,rstring contains raw 8-bit characters.This allows the developer to pick
Unicode when international character sets are important,and to pick raw strings
when constant-time random access and a tight representation are important.A type
suffix in the literals indicates the string kind:r indicates rstring and u indicates
ustring,the default (without suffix) being rstring.String literals can use escape
sequences of the form\uhhhh,where the four hexadecimal digits hhhh specify a
character.For example,"pi\u00f1ata"u uses the escape\u00f1 to specify a ñ with
a tilde on top in a ustring.Other string escape sequences are similar to those in C
or Java.
Table 3.String escape character
String escape character Meaning
\a Alert
\b Backspace
\f Form feed
\n Newline
\r Carriage return
\t Horizontal tab
\v Vertical tab
\’ Single quote
\"Double quote
\?Question mark
\0 Null character
\\Literal backslash
Recall from topic “Lexical syntax” on page 4 that SPL files are written in UTF-8,so
letters such as ñ can also appear directly in a string literal,without the escape
sequence.Both ustring and rstring may contain internal null characters,which,
unlike in C,are not considered terminating.In other words,characters whose
encoding is zero carry no special meaning,and the length of a string is
independent from whether or not it contains such characters.
Literals for complex numbers are written as a list literal with a cast,as in
(complex32)[1.0,2.0].
Chapter 2.Types
9
The timestamp type is designed to allow a high degree of precision as well as
avoid overflow conditions (it can represent values ranging over billions of years)
by following widely accepted standards.It uses a 128 bit representation,where
64 bits store the seconds since the epoch as a signed integer,32 bits store the
nanoseconds,and 32 bits store an optional identifier of the machine where the
measurement was taken,which can be useful for after-the-fact drift compensation.
The epoch,time zone,etc.depend on the library functions used to manipulate
timestamps;for more information about these functions,see the API
documentation bundled with the product release.A timestamp can be initialized
using one of the SPL functions or from a float64.There is no literal for a
timestamp.
Many operators and functions are overloaded to work with different types.For
example,the operator + can add various types of numbers,but it can also
concatenate strings.Likewise,the function length(x) is overloaded to accept x of
type rstring or ustring.
In order to permit efficient marshaling and unmarshaling of network packets,SPL
offers a bounded-size variant of rstring,list,set,and map types.For example,
rstring[5] can store any rstring of at most 5 characters,and each character in an
rstring takes 1 byte.If all parts of a data value have a fixed size,then parts can be
found at a fixed offset without decoding the whole.The compiler prohibits implicit
conversions from unbounded to bounded types,but the user can override that by
explicit casts.Type bounds,whether in variable declarations or in casts,must be
compile-time constants.A cast from any string to a bounded string truncates the
value if it is too long.We considered using slicing instead of casts to convert
unbounded to bounded values,but decided against it because unlike type bounds,
slicing parameters are not necessarily compile-time constants,and slicing does not
alter the type.SPL does not offer bounded ustring values,because bounding the
number of Unicode characters would not achieve the goal of fixing the size of the
byte representation.SPL limits all strings,bounded or unbounded,raw or Unicode,
to at most 2
31
-1 characters.
Blobs are sequences of at most 2
63
-1 raw bytes.A blob can be initialized from a
list<uint8>.There is no literal for a blob.
Implementation note:The language specification purposely does not specify a byte
order,because users are oblivious to these details within an SPL application.
Compilers are expected to provide flags to choose,for example,network byte order
or native byte order.The internal representation of both bounded and unbounded
strings and blobs stores a separate length field.As with all types,the exact layout
is implementation dependent and not exposed at the language level.Typically,
bounded types would be padded in case the length is lower than the bound,thus
allowing subsequent attributes to be stored at a fixed offset in a network packet.
The implementation may also reduce the number of bits in the length field
according to the bound to save space.The maximum tuple size when serializing a
tuple's contents for network transmission is 2
32
-2 bytes.At the C++ level,a
SPLSerializationException will be thrown if the maximum tuple size limit is
exceeded.When implementing primitive operators,a developer might elect to
catch and handle this exception,avoiding sudden termination of the operator at
runtime.
Practical advice:Use the decimal floating point types in financial,commercial,and
user-centric programs to avoid losing decimal digits to binary rounding.For
unstructured data of bounded size,use a list<uint8>[n] instead of a blob.For
more information,see “Composite types” on page 11.
10
IBM InfoSphere Streams Version 2.0.0.4:IBM Streams Processing Language Specification
|
|
|
|
|
|
|
|
|
|
|
|
|
For SPADE users:Several primitive types are new in SPL.For example,uint,
ustring,decimal,or timestamp.
Composite types
A composite type is the result of applying a type constructor to one or more type
arguments.For example,the composite type list<int32> is the result of applying
the type constructor list<T> to the type argument T=int32.
For SPADE users:The only composite types in SPADE were matrix and list;SPL
drops matrix,but introduces additional composite types tuple,set,and map.Also
in SPADE,a list could only contain primitive values;in SPL,composite types nest
arbitrarily.SPL drops matrix types,because they were mainly used for emulating
nested lists,and SPL supports that directly.Code that really manipulates matrices
as opposed to nested lists is more naturally written in a native language using
SPL's extension mechanisms.
Collections
SPL has three built-in collections (list,set,and map) with the following type
constructors:
Table 4.Constructors of built-in collections
Constructor Collection
list<T> list (random-access zero-indexed sequence)
set<T> set (unordered collection without duplicates)
map<K,V> map (unordered mapping from key type K to value type V)
list<T>[n] like list<T>,but bounded-length
set<T>[n] like set<T>,but bounded-length
map<K,V>[n] like map<K,V>,but bounded-length
Collections in SPL have bounded-length variants that follow the same rules as
bounded strings.For example,list<int32>[2] can store a maximum of two
integers.You can declare bounded-size types with unbounded elements,such as
list<rstring>[3],though doing so does not offer the same marshaling
optimization opportunities.SPL limits all bounded or unbounded collections to a
maximum of 2
31
-1 elements.For types with bounded size,the bounds must be
compile-time constants.
The literal syntax for list values and map values is the same as in Python or
JavaScript (including JSON),and the literal syntax for set values is similar to
Python or JavaScript.
listLiteral::='[’ expr
*,
']’#e.g.[0,x,x+1,x+2]
mapLiteral::='{’ ( expr':’ expr )
*,
'}’#e.g.{"Mon":−1,"Fri":1 }
setLiteral::='{ ’ expr
*,
'}’#e.g.{"IBM","GOOG"}
List,set,and map types can be arbitrarily nested.In other words,their elements,
and in the case of maps even their keys,can be of any type,including other
composite types.Empty list,set,and map literals can only occur where their type
is clear from the context:casts,variable initializers,assignments,and typed
operator parameters.
As we will see in the topic Chapter 3,“Expression language,” on page 17,
composite types have their own operators.You can subscript a list or map with
Chapter 2.Types
11
l[i],check membership in a collection with"x in s",iterate over a collection with
"for(T x in s)...",and so on.There are also various functions to work on
collections;for example,setUnion(set1,set2) or setIntersection(set1,set2).
For more information,see Chapter 7,“VWAP example,” on page 63.For lists and
maps,the left operand of the in operator is the key.In other words,"Oz"in m
checks whether"Oz"is a key of the map m.Value (as opposed to key) membership
tests use functions,not the in operator.
Implementation note:Lists are implemented via arrays,but unlike the C language,
SPL uses static or dynamic checks to protect against out-of-bounds accesses.Sets
and maps are implemented via hash tables.They are unordered and support
constant-time lookup.
Tuples
A tuple is a sequence of attributes,and an attribute is a named value.For example,
the tuple { sym="Fe",no=26 } consists of two attributes sym="Fe"and no=26.The
type for this tuple is tuple<rstring sym,int32 no>.Tuples are similar to database
rows in that they have a sequence of named and typed attributes.Tuples differ
from objects in Java or C++ in that they do not have methods.You can access the
attributes of a tuple with dot notation.For example,given a tuple t of type
tuple<rstring sym,int32 no>,the expression t.sym yields the sym attribute.
Attributes within a tuple are ordered:tuple<int32 x,int32 y> and tuple<int32
y,int32 x> are two different types.But this ordering is hidden at the language
level,and may be changed by the runtime implementation for optimization.The
only way a program can depend on the declaration order of attributes is in source
operators that read from a text file formatted as comma-separated values (csv).
Tuple types can extend other tuple types,by adding more attributes.For example:
type Loc2d = tuple<int32 x,int32 y>;//1
type Loc3d = Loc2d,tuple<int32 z>;//2
The resulting type Loc3d is equivalent to tuple<int32 x,int32 y,int32 z>.In
tuple extension,if the tuples have an attribute with the same name and type,the
resulting tuple contains that attribute just once,in the position of its left-most
occurrence.It is an error for the tuples in tuple extension to have attributes with
the same name but different types.
In tuple extension,the identifier may also be the name of a stream as a shorthand
for the type of the tuples in that stream.For example:
composite Main {//1
type LocWithId = LocStream,tuple<rstring id>;//2
graph stream<int32 x,int32 y> LocStream = Beacon() {}//3
}//4
The resulting type LocWithId is equivalent to tuple<int32 x,int32 y,rstring
id>.The syntax for tuple types is:
tupleType::='tuple’'<’ tupleBody'>’
tupleBody::= attributeDecl
+,
#attributes
| ( ID | tupleType )
+,
#tuple type extension
attributeDecl::= type ID
In other words,the body of the tuple type either consists of a comma-separated list
of attribute declarations,or a comma-separated list of tuple types,either by name
or written in-place.This extension syntax deviates from the inheritance syntax in
object-oriented languages like Java because it needs to be light-weight and
12
IBM InfoSphere Streams Version 2.0.0.4:IBM Streams Processing Language Specification
nestable.Tuple types can be arbitrarily nested.In other words,their attributes can
be of any type,including other composite types,even other tuple types.For
example:
type Loc2d = tuple<int32 x,int32 y>;//1
type Sensor = tuple<rstring color,tuple<Loc2d,tuple<int32 z>> loc>;//2
After extending the type Loc2d with a z attribute,and using the whole thing as a
loc attribute,the resulting type Sensor is tuple<rstring color,tuple<int32 x,
int32 y,int32 z> loc>.For example,given a variable Sensor s,the expression
s.loc.z accesses the z-coordinate.
As we will see in the topic “Type definitions,” the explicit tuple<...> type
constructor can be omitted at the top level of a type definition.That means that the
previous example can also be written like this:
type Loc2d = int32 x,int32 y;//1
type Sensor = rstring color,tuple<Loc2d,tuple<int32 z>> loc;//2
The literal syntax for tuple values is as follows:
tupleLiteral::='{’ ( ID'=’ expr)
*,
'}’#e.g.{ x=1,y=2 }
The empty tuple literal,{},is only valid in casts or initialization.For example,
mutable Loc3d myLocation = {};zero-initializes myLocation,equivalent to mutable
Loc3d myLocation = {x=0,y=0,z=0};.
Type definitions
A type definition gives a name to a type.For example:
composite Main {//1
type Integers = list<int32>;//2
MySchema = rstring s,Integers ls;//3
graph/*...*///4
stream<MySchema,tuple<int32 id>> SenSource = FileSource() {//5
param file:"SenSource.dat";format:csv;//6
}//7
}//8
The type MySchema is defined as a tuple,leaving out the optional constructor
tuple<...>.The type can be used in operator invocations.In this case,it is used in
"MySchema,tuple<int32 id>",which extends MySchema by adding another attribute
"int32 id".Tuples on stream SenSource have the type tuple<rstring s,
list<int32> ls,int32 id>,in other words,all attributes of MySchema,plus the
additional id attribute.
The syntax of a type definition is:
standAloneTypeDef::='type’ ID'=’ ( type | tupleBody )';’
compositeTypeDef::='static’
?
ID'=’ ( type | tupleBody )';’
Type definitions in composite operators can be static or non-static.Only static type
definitions (with the static modifier) can be used from outside the defining
operator with the Op.Type notation.Only non-static type definitions (without the
static modifier) can depend on instance-specific entities such as ports,streams,
parameters,or other non-static types.
(For SPADE users:Here is what the same example looked like in SPADE:
[Typedefs]#1
typedef Integers IntegerList#2
[Program]#3
vstream MySchema(s:String,ls:Integers)#4
stream SenSource(schemaFor(MySchema),id:Integer)#5
:= Source() ["file:///SenSource.dat",nodelays,csvformat]{}#6
Chapter 2.Types
13
SPADE supported two different kinds of type definitions.Tuple types were defined
in the [Program] clause with the vstream keyword,whereas all other types were
defined in the [Typedefs] clause with the typedef keyword.SPL unifies the two.)
It is common to define a stream whose tuples have the same type as those of
another stream.To minimize syntactic overhead for this,SPL allows using the
name of a stream to refer to its type.For example:
composite Main {//1
type T = int32 i;//2
graph stream<int32 i> S = Beacon() {}//3
stream<T> B = Beacon() {}//using T as type//4
stream<S> C = Beacon() {}//using S as type//5
}//6
(For SPADE users:In SPADE,you had to use the schemaFor function to get the
type of a stream,whereas in SPL,you can simply use the name of a stream to
stand for its type.)
Value semantics
SPL treats streaming data as pure copies rather than having an identity in an
address space,because that is most efficient and natural.Disallowing references
also prevents null pointer errors,simplifies memory management,and prevents
unintended side effects of mutating a value stored in a collection or used as a map
key.For these reasons,all primitive and composite SPL types have value semantics,
not reference semantics.This means that:
v An assignment (=) copies the contents,instead of aliasing the location.
v An equality comparison (==) compares the contents,instead of comparing the
locations.
One consequence of value semantics is that modifying the original after an
assignment does not change any copies.Consider this example:
void test() {//1
mutable map<rstring,tuple<int32 x,int32 y>> places = { };//2
mutable tuple<int32 x,int32 y> here = { x=1,y=2 };//3
places["Hawthorne"] = here;//4
here.y = 3;//5
}//6
Line 3 initializes variable here with the value {x=1,y=2}.Line 4 assigns a copy of
that value into the map at key"Hawthorne".Line 5 modifies the version of the
value in variable here,so variable here now contains {x=1,y=3}.However,this
does not affect the copy at places["Hawthorne"],which is still {x=1,y=2}.Another
consequence of the value semantics is that since there is no notion of a reference,
there is no notion of null,nor can there be any cyclic data structures in SPL.
Practical advice:If you need a more powerful type system than that provided by
SPL,use SPL's extension mechanisms and implement your functionality in a
low-level language such as C++ or Java.
Type conversions
SPL is statically typed and strongly typed and has structural equivalence.SPL
adopts a strong static type system,because it saves time for runtime checks,and
errors that are not prevented statically are difficult to track down in a distributed
system.The static typing is enforced by explicit type declarations and compile-time
type checking.The strong typing is enforced by providing only few implicit
conversions between types.Structural equivalence means that types of the same
14
IBM InfoSphere Streams Version 2.0.0.4:IBM Streams Processing Language Specification
structural composition are considered identical.SPL uses structural equivalence,
because that facilitates converting data to and from external applications,files,or
databases that do not share the same type system.While structural equivalence can
lead to types being considered equivalent even when they were intended to differ,
we decided for structural equivalence anyway to avoid adapter bloat,which can
slow down programs and clutter up code.Instead of implicit conversions,SPL
offers explicit conversions (casts) where it makes sense.Explicit conversions look
like in Java or C,for example,(int32)2.5 returns 2.
SPL allows the following explicit type conversions:
v From any type to the same type (identity conversion).
v From any type to any string type and back (string conversion).These
conversions behave like they first serialize to UTF-8,then convert to the target
type.
v From a blob to a bounded or unbounded list<uint8> and back.
v From any enumeration type to any integral type and back.In conversions,
enumerations are numbered from zero.For example,given type t =
enum{a,b,c};,then (int32)a == 0,and (t)1 == b.
v From any non-complex number type to any other non-complex number type.
Widening a signed integral type to a larger signed integral type performs
sign-extension on the 2s-complement representation.Other integral widenings
zero-extend the representation.All integral narrowings perform truncation,
discarding higher-order bits.Conversions from integers to floating-point
numbers round to the nearest value,and conversions from floating-point
numbers to integers round towards zero.Despite the fact that numeric
conversions may lose information due to rounding,overflow,and so on,they
never cause a runtime exception.
v From timestamp to float64 and back,where the float64 represents seconds.
v From a list of two non-complex numbers to a complex number.
v From any complex number type to any other complex number type.
v From any list type to any other list type,bounded or unbounded,if the element
types are either identical,or the element types are convertible and primitive.
v From any set type to any other set type,bounded or unbounded,if the element
types are either identical or convertible and primitive.
v From any map type to any other map type,bounded or unbounded,if the key
and value types are either identical or convertible and primitive.
v From a tuple type WideT to a tuple type NarrowT if WideT has a superset of the
attributes of NarrowT,in the same order.The conversion discards excess
attributes and cannot be reverted.For more flexible tuple conversions,see the
assignFrom library function.
v From the empty tuple literal {} to any tuple type.
Note that SPL does not permit conversions from and to boolean.For example,
(boolean)Myint is illegal,because it is shorter and less ambiguous to use integer
comparison such as Myint!=0.
When SPL converts to bounded strings or lists,it drops excess elements at indices
exceeding the bound.When SPL converts to bounded sets or maps,and the size
exceeds the bound,it drops elements too,but the specific dropped elements are
implementation-dependent.
Types in SPL are equivalent if they are the same primitive types,or if they are
composed from equivalent types using the same type constructor (and in case of
Chapter 2.Types
15
tuples,the same attribute names in the same order).For example,variables of
types LocT1 and LocT2 in the following code can be assigned to each other:
void test() {//1
type LocT1 = int32 x,int32 y;//2
type LocT2 = int32 x,int32 y;//same attributes in same order//3
LocT1 loc1 = { x=1,y=2 };//4
mutable LocT2 loc2;//5
loc2 = loc1;//this is legal//6
}//7
As far as SPL is concerned,LocT1 and LocT2 are the same type;the fact that they
have different names is irrelevant.Therefore,the assignment between variables
loc1 and loc2 is legal,and does not constitute a type conversion.(For language
experts:The alternative to structural equivalence is name equivalence.For
example,in Java,two classes with different names are not equivalent even if they
have the same attribute names,types,and order.Since SPL types describe pure
data without behavior,and types are often anonymous and declared inline,
structural equivalence made more sense for SPL.)
SPL permits implicit conversions only in two places:subscripts and variable
initializers.For an example of an implicit type conversion in a subscript,let v be a
list.Lists indices are always uint32,so the subscript v[9] implicitly converts from
int32 to uint32,as if it had been written v[(uint32)9].An out-of-bounds subscript
causes a runtime error,independently of whether or not there was an implicit
conversion involved.Not all subscripts are convertible.For example,subscripting a
list v with a boolean v[true] is a compile-time error.
An example of an implicit type conversion in a variable initializer is int8 x = 3;.
This code implicitly converts from int32 to int8,as if it had been written int8 x =
(int8)3;.(For language experts:Implicit conversion in initializers prevents what is
known as type stuttering which is unnecessary repetition of the type type in the
initializer.)
Unlike some other languages,in other contexts,SPL has no implicit conversion
from int to string (no"num"+1),from int to float (no 1+2.0),or from int to
boolean (no while(1)...).To perform these conversions,you must make them
explicit:(rstring)1,or (float64)1,or true.
For SPADE users:Both SPADE and SPL are statically typed and strongly typed,
with few implicit conversions.The explicit conversion syntax of SPADE is more
verbose than that of SPL.
16
IBM InfoSphere Streams Version 2.0.0.4:IBM Streams Processing Language Specification
Chapter 3.Expression language
Though SPL is a streaming language,there are many places where it uses
expressions found in traditional imperative or functional languages.SPL
expressions look similar to those in languages like C,Java,or Python:for instance,
they have the same operators and even the same operator precedence.But since
run-time faults are a big challenge in distributed systems,SPL performs stricter
compile-time error checks on expressions than C,Java,or Python:it permits fewer
implicit type conversions,and restricts aliasing,side effects,and non-determinism
more.SPL's static guarantees also facilitate more compiler optimization.Some
expressions are evaluated at runtime:in operator output assignments or operator
parameters.Some expressions are evaluated purely at compile time,during code
generation:these include operator configurations and window clauses.And finally,
there are places where SPL accepts not only expressions,but even statements like
those in traditional languages:this is the case in operator logic and in function
definitions.Strictly speaking,SPL does not really need statements,because the user
can put such logic in native code.It supports statements anyway,because
statements are often easier to write directly in SPL than by going to a native
language.Also,statements written in pure SPL are more portable across back-end
languages,and offer more opportunities for front-end optimizations.
Expression operators
Expression operators are used to compute values,for example,with + or -.They
are not to be confused with stream operators,used to connect streams into a data
flow graph.The SPL operator table looks more or less like that of C,Java,or
Python:
Table 5.Expression operators used in SPL
Operator Arity Description
(e) 1 Parentheses for overriding operator precedence
e(...) N Function call
e[...] N Map lookup,or subscript or slice into string/
blob/list
e.id 1 Tuple attribute access
(t)e 1 Type cast
++,--,!,-,~ 1 Increment,decrement,prefix logical/
arithmetic/bitwise negation
*,/,% 2 Multiplication,division,remainder
+,- 2 Addition,subtraction
<<,>> 2 Bitwise shift left,right
<,<=,>,>=,!=,== 2 Comparison (value semantics)
& 2 Bitwise and
^ 2 Bitwise xor
| 2 Bitwise or
in 3 Membership in list/map/set
&& 2 Logical and (short-circuit)
|| 2 Logical or (short-circuit)
?:2 Ternary conditional
=,+=,-=,*=...2 Assignment (value semantics)
© Copyright IBM Corp.2009,2012
17
In this table,precedence goes from high (at the top) to low (at the bottom).
Literals,such as strings,maps,tuples,lists,etc,have higher precedence than the
highest-precedence operator.The middle column of the table indicates arity (the
number of operands).All the binary operators (arity 2) are left-associative.
An expression of the form (t)e is treated as a cast if the left-hand side t is a literal
type or a simple or qualified name.Examples of casts include (list<int32>)[],
(complex32)[2,3],(x)−y,and (a.b::c.d)e.f.Examples of expressions that are not
casts even though they look similar include (f())e and (x.+ y)e.
Besides the simple assignment operator (=),the following operators first perform a
regular binary operation,and then an assignment:+=,−=,*=,/=,%=,<<=,>>=,&=,
^=,and |=.
As mentioned in the topic “Value semantics” on page 14,SPL uniformly uses value
semantics.Hence,even in the case of composite values,assignments always copy
the contents,and comparisons always compare the contents,never the location,of
a value.
The right-shift operator (>>) behaves differently for signed and unsigned integers.
For signed integers,it fills in with the sign bit,whereas for unsigned integers,it
fills in with zero.Use casts if you want to override this behavior.
The equality comparison operators == and!= work on all SPL types,including
complex and containers.The ordered comparison operators <,>,<=,and >= work
on all ordered types.That includes most numeric types (all except complex
numbers),enumerations,blobs,timestamps (where the ordering ignores the
machine identifier),and strings (where the ordering is lexicographical by logical
character).Using <,>,<=,or >= on complex numbers,containers,tuples,or
booleans is a compile-time error.
Subscripts and slicing
String,blob,and list subscripts can either retrieve a single element,or can perform
slicing.Map subscripts can only refer to one element,not a slice,since maps are
unordered.The syntax and semantics for subscripting with an index or a slice
match the rules in Python.All string,blob,and list indexing is zero-based,and
slices include their lower bound but exclude their upper bound.For example,if a
is a list,then a[2:5] is the same as the list [a[2],a[3],a[4]].If the lower bound
is omitted,the slice starts at element zero;if the upper bound is omitted,the slice
continues until the last element.For example,if the last element of a has index 9,
then a[7:] is the same as the list [a[7],a[8],a[9]].Here is the syntax:
subscriptExpr::= expr'[’ subscript']’
subscript::= expr | ( expr
?
':’ expr
?
)
String subscripts and slices are character-based,and result in strings.For example,
if s is a ustring,then s[3] retrieves the third Unicode character counting from 0,
which is also of type ustring.For rstring values,subscripts and slices are also
character-based,but all characters are 8-bit,so index computations are always
constant-time.
Invalid subscripts cause runtime errors.The topic “Runtime errors” on page 20
explains what happens upon errors.A subscript is invalid if it is out-of-bounds for
its collection,unless it is the target of an assignment to a new key in a map.For
example,if list v has 3 elements,then only indices 0 <= i <= 2 are valid,and v[i]
18
IBM InfoSphere Streams Version 2.0.0.4:IBM Streams Processing Language Specification
is a runtime error for all i>=3.On the other hand,if map m has no key a,then the
assignment m["a"] = 1 performs auto-vivification (inserts a new key) for a and
maps it to the value 1.SPL supports auto-vivification for consistency with Python
and with the C++ standard libraries,but restricts it to cases where the assignment
is a pure write,without an earlier read step.For example:
mapOfInt["newKey"] = 3;//auto-vivify"newKey"//1
tupleOfMap.m["newKey"] = 5;//auto-vivify"newKey"//2
mapOfMap["oldKey"]["newKey"] = 5;//auto-vivify"newKey"//3
mapOfMap["newKey"]["oldKey"] = 5;//error:must read from"newKey"first//4
mapOfInt["newKey"] += 2;//error:must read from"newKey"first//5
mapOfTuple["newKey"].b = 5;//error:must read from"newKey"first//6
A slice x[lower:upper] is valid even if lower or upper is out of bounds.Here are
some examples:
void test() {//1
list<rstring> x = ["a","b","c"];//2
rstring s = x[4];//runtime error:index out of bounds//3
mutable list<rstring> y;//4
y = x[1:5];//["b","c"]//5
y = x[5:5];//[ ]//6
y = x[5:1];//[ ]//7
y = x[5:];//[ ]//8
y = x[0:2];//["a","b"]//9
y = x[:2];//["a","b"]//10
y = x[2:0];//[ ]//11
}//12
Practical advice:If you are not certain whether a subscript is valid,guard it with a
defensive membership test,for example:
rstring munchkinLand(map<rstring,rstring> places) {//1
if ("Oz"in places)//2
return places["Oz"];//3
return"not found";//4
}//5
For SPADE users:The slicing rules match those in SPADE,except that the upper
bound of a slice was inclusive in SPADE but is exclusive in SPL for consistency
with the STL algorithms in C++ and with Python.
Mapped operators
SPL takes inspiration from Matlab and supports auto-vectorization or mapping of
expression operators to work over lists and maps.There are two kinds of mapped
operators:non-dotted and dotted.Non-dotted binary operators such as *,+=,or -
are mapped if one operand is a scalar and the other a list or map.Here are some
examples:
void test() {//1
mutable list<int32> ls = [1,2,3];//2
ls = 2 * ls;//2 * [1,2,3] == [2,4,6]//3
ls += 2;//[2,4,6] + 2 == [4,6,8]//4
ls = ls - 1;//[4,6,8] - 1 == [3,5,7]//5
mutable map<rstring,int32> mp = {"x":1,"y":2};//6
mp = 2 * mp;//2 * {"x":1,"y":2} == {"x":2,"y":4}//7
mp += 2;//{"x":2,"y":4} + 2 == {"x":4,"y":6}//8
mp = mp - 1;//{"x":4,"y":6} - 1 == {"x":3,"y":5}//9
}//10
SPL also has dotted mapped operators such as.+ or.*.If both operands are
equal-length lists or maps with the same key set,the dotted operators work on
corresponding pairs of elements at a time.For example:
[3,2].* [5,4] == [3*5,2*4] == [15,8]//multiply two lists//1
{"x":4,"y":6}.- {"x":3,"y":1} == {"x":1,"y":5}//subtract two maps//2
Chapter 3.Expression language
19
If the operands are lists of different sizes or maps with different key sets,the
mapped operator triggers a runtime error.Dotted operators have the same
precedence as their non-dotted counterparts.
Table 6.List of mapped operators
Dotted operators Description
.*./.% Mapped multiplication,division,remainder
.+.- Mapped addition,subtraction
.<<.>> Mapped bitwise shift left,right
.<.<=.>.>=.!=.== Mapped comparison
.& Mapped and
.^ Mapped xor
.| Mapped or
Mapped operators only unwrap a single dimension.For example,2 *
[[1,2],[3,4]] is not supported,and neither is [[1,2],[3,4]].* [[5,6],[7,8]],
because they both would have to unwrap multiple dimensions of lists before the
operators are applicable.
For SPADE users:SPADE implicitly mapped both expression operators and
functions over lists.SPL only implicitly maps operators,not functions,because the
user can achieve the same effect with an explicit function and loops,and because
doing it implicitly would cause confusion with side effects and with SPL's richer
type system.
Runtime errors
Even though SPL has been designed to perform most of its error checking at
compile time,runtime errors can still happen in some cases.This includes an
invalid subscript for a list,blob,string,or map;a size mismatch in the operands of
mapped expression operators;integer division by zero;or exceptions in libraries
(such as C++ or Java exceptions).It excludes floating point division by zero,which
has the usual behavior of resulting in infinity if the numerator is non-zero and
NaN if the numerator is zero.When any runtime error occurs,the entire partition
(execution container) enclosing the streaming operator invocation dies,writes the
error to a log file,and stops accepting new tuples.
We considered adding full-fledged exception handling to SPL,with
try/catch/finally statements.But we decided against that because if the user
anticipates the exception,it is easy to check with assertions,whereas if the
exception is unanticipated,logging and a clean fault are easier to use and more
helpful than exception handlers a user could write.Adding exceptions might be
considered in a future release.
To help the user with error handling,SPL provides APIs for assertions and logging.
Assertions look like calls to an assert function with the following signatures:
void assert (boolean condition)//1
<string T> void assert (boolean condition,T message)//2
An assertion failure prints the message together with the line number to the log
file,and kills the enclosing partition.Unlike regular function calls,and like
assertions in C or Java,SPL assertions can be disabled.When an SPL program is
20
IBM InfoSphere Streams Version 2.0.0.4:IBM Streams Processing Language Specification
compiled with assertions disabled,their parameter expressions are not evaluated,
saving time and possibly runtime errors in the parameter expressions.
Besides assertions,SPL also provides two other features to help testing and error
handling.One is a log function,with the following signatures:
//type spl::Sys.LogLevel = enum { error,info,debug,trace };//1
<string T> void log(Sys.LogLevel logLevel,T message)//2
<string T> void log(Sys.LogLevel logLevel,T message,T aspect)//3
The value error means that the log is unmaskable,whereas the values info,debug,
and trace mean that logging should be performed when the user requested least
verbose,more verbose,or most verbose logging,respectively.
Besides assertions and logging,SPL also provides an abort() function,which
unconditionally terminates the partition,even when compiled with assertions
disabled.
Statements
Statements are the traditional building blocks of imperative languages.As a
streaming language,SPL does not need statements most of the time,but they are
permitted inside operator logic and function definitions as a convenience.SPL's
assortment of statements is deliberately simple;this makes it easy to optimize and
easy to map to target languages,leading to better performance and portability.
A variable definition consists of an optional mutable modifier and a type,followed
by a comma-separated list of variable identifiers with optional initializer
expressions,followed by a semicolon:
varDef::='mutable’
?
type ( ID ('=’ expr )
?
)
+,
';’
An example variable definition is mutable int32 i=2,j=3;.An immutable
variable is deep-immutable:even if it has a composite value,all parts of that value
are immutable too,recursively.The variable definition syntax is similar to C or
Java,except that the type does not get tangled up with the variable.For example,
SPL does not have variable definitions like int32 x,y[],z;.Immutable local
variables (without mutable modifier) must be initialized upon declaration.All
variables must be initialized before use.
A block consists of zero or more statements or type definitions,surrounded by
curly braces:
blockStmt::='{’ ( stmt | standAloneTypeDef )
*
'}’
An example block is {type T=int32;T i=0;foo(i,2);}.Local types and variables
defined in a block are in scope for the entire block.Blocks are often used as bodies
for control statements like if/while/for,or as function bodies.
An expression statement consists of an expression followed by a semicolon:
exprStmt::= expr';’
An example expression statement is foo(i,2);.Obviously,the purpose of an
expression statement is its side-effect.The only operators that have non-error
side-effects are assignments,certain function calls,and increment/decrement
(++/−−) operators.
An if statement can have an optional else clause:
Chapter 3.Expression language
21
ifStmt::='if’'(’ expr')’ stmt ('else’ stmt )
?
Dangling else is resolved to the innermost if;you can override this with blocks.
SPL does not have a C-style switch statement.
A for statement loops over the elements of a string,list,set,or map:
forStmt::='for’'(’ typeID'in’ expr')’ stmt
SPL's for loops are similar to for (type ID:expr) loops in Java 5,but use the
Python-style in instead of the colon (:).SPL does not have a C-style 3-part for
loop,and the iterated-over collection becomes immutable for the duration of the
loop.This means that SPL for loops are countable loops,which are less
error-prone and permit more optimization opportunities than traditional loops.A
for loop over a string or list iterates over the elements or characters in index order.
A for loop over a set or map has an implementation-specific iteration order.In a
for loop over a map,the loop variable ID iterates over the keys;a common idiom
is to retrieve the associated value from the collection in the loop body.
A while statement looks just like in C or Java:
whileStmt::='while’'(’ expr')’ stmt
A break statement abruptly exits a while or for loop:
breakStmt::='break’';’
A continue statement abruptly jumps to the next iteration of a while or for loop:
continueStmt::='continue’';’
A return statement abruptly exits a function,optionally returning a value:
returnStmt::='return’ expr
?
';’
To summarize,SPL supports the following assortment of statements:
stmt::= varDef | blockStmt | exprStmt | ifStmt | forStmt
| whileStmt | breakStmt | continueStmt | returnStmt
For SPADE users:SPADE supported a similar assortment of statements in custom
logic in Functor operators.
Functions
SPL functions are similar to C functions:they can take parameters,return a value
or return void,and are defined in a namespace,and not nested in anything else.
Functions are called from expressions,which can occur in many places in an SPL
program.There are two kinds of functions:non-native functions are written in SPL,
whereas native functions are written in C/C++.Both kinds of functions can be
invoked from SPL with the same syntax,the caller is oblivious to the
implementation language.
Functions themselves can be passed as parameters to operators,because an
operator invocation is fully resolved at compile time.But SPL does not permit
passing functions to other functions,or storing them in variables,because it makes
code harder to understand for the user,and it is harder to optimize indirect
invocations in a static compiler.For language experts:SPL provides neither
first-class functions nor higher-order functions.
22
IBM InfoSphere Streams Version 2.0.0.4:IBM Streams Processing Language Specification
Functions can be overloaded.In other words,there can be multiple functions with
the same name in the same scope,as long as they have different parameter
signatures.For example,the library has two function definitions for log,such that
log(4.2) computes a logarithm whereas log(Sys.info,"the answer") prints a
log-file message.While SPL permits overloading on parameter types,it forbids
overloading on return value types.For example,declaring both int32 f() and
rstring f() in the same scope is a compile-time error.
SPL functions
SPL functions are written directly in SPL,not in C/C++.Here is an example
non-native function definition:
float64 toCelsius(float64 fahrenheit) {//1
float64 freezing = 32f,ratio = 5f/9f;//2
return (fahrenheit - freezing) * ratio;//3
}//4
An SPL function definition consists of a head and a block:
functionDef::= functionHead blockStmt
functionHead::= functionModifier
*
type ID'(’ functionFormal
*,
')’
functionModifier::='public’ |'stateful’
functionFormal::='mutable’
?
type ID
The function head contains optional modifiers,the return type,identifier,and list
of parameter definitions.
The modifiers stateful for functions and mutable for function parameters are
described from the caller's perspective in the topic “Side-effects” on page 25.From
the perspective of the function body,they restrict what it can do:the body of
function f1 can only call a stateful function f2 if f1 is declared stateful too,and the
body of a function can only modify a function parameter if that parameter is
declared mutable.
The modifier public for SPL functions makes the function visible from other
namespaces.Without the modifier,SPL functions are private and only visible in
their own namespace.Distinguishing public from private functions helps in large
projects,because internal helper functions do not become part of the public
interface,and because access to stateful private functions is restricted to code that
is aware of the invariants for the state.
Native functions
Native function prototypes are declared with SPL syntax in an XML model file,but
native function implementations are defined in a native file.Here,we only discuss
prototypes,leaving the discussion of the native implementation to the IBM Streams
Processing Language Toolkit Development Reference.An example native function
prototype is:
<ordered T> T max(list<T>)
This declares a generic max function that works on lists of any ordered type T.An
ordered type is a type for which the ordered comparison operators (<,>,<=,
and >=) are defined,including strings,timestamps,enumerations,blobs,and all
numeric types except complex numbers.Here is the syntax:
functionPrototype::= genericFormals functionModifier
*
type ID'(’ protoFormal
*,
')’
genericFormals::= ('<’ typeFormal
+,
'>’ )
?
('[’ boundsFormal
+,
']’ )
?
typeFormal::= typeFormalConstraint ID
typeFormalConstraint::='any’ |'collection’ |'complex’ |'composite’ |'decimal’
|'enum’ |'float’ |'floatingpoint’ |'integral’
Chapter 3.Expression language
23
|'list’ |'map’ |'numeric’ |'ordered’ |'primitive’
|'set’ |'string’ |'tuple’
boundsFormal::= ID
protoFormal::= formalModifier
*
type ID
?
Like SPL functions,native functions are stateless unless explicitly declared stateful,
and their parameters are immutable unless explicitly declared mutable.
Unfortunately,it is impractical to statically check statelessness or strict deep
parameter immutability for native functions,so library vendors must be careful to
declare native functions stateful or parameters mutable when they read or write
outside state.The modifier public for native functions makes the function visible
from other namespaces.Without the modifier,native functions are private and only
visible in their own namespace.
A generic type formal such as <ordered T> in the max function prototype can match
any actual parameter type subject to two restrictions:first,the actual parameter
type must obey the typeFormalConstraint;and second,the match must be the
same even if it occurs in multiple parameters.Consider the following native
function prototype:
<list T> T concat(T vals1,T vals2)
In this case,the first restriction requires that the parameters must be of some list
types,and the second restriction requires that both parameters are of the same list
type.For instance,the call concat ([1,2],[3,4,5]) is correct,because both
parameters are of type T==list<int32>.
Besides generic type formals,a native function prototype also has optional generic
bounds formals.For example,here is a prototype that overloads the max function
for bounded lists:
<ordered T>[N] T max (list <T>[N])
Again,if the bounds-formal appears multiple times in the prototype,then all the
matching bounds from actual parameter types must be identical.If there are
ambiguous overloaded versions of the same generic native function in the same
scope,the compiler flags an error.
For language experts:Native function call resolution,type checking,and overload
conflict detection uses a standard unification algorithm,augmented with checks for
the constraints of type formals.
Practical advice:Use the following guidelines to decide whether to write your
functions in SPL or as native C/C++ code.Write an SPL function if you want to
avoid the burden of going to a different file or language,or if you want the
portability of auto-generated code,or you want the compiler to check statelessness
for you.Write a native function if you want to reuse existing native code,or if you
need generics,or if the native language permits a more natural implementation,or
if the native function is significantly faster than one written in SPL.
For SPADE users:All functions in SPADE were native.
Pass-by-reference
All parameters (mutable or immutable,primitive or composite) are passed
by-reference.In other words,the formal parameter in the callee is an alias of the
actual parameter in the caller.In some cases,for example,in the case of an
immutable scalar,pass-by-value yields the same behavior as pass-by-reference.In
such cases,the compiler may chose to optimize the code by internally
24
IBM InfoSphere Streams Version 2.0.0.4:IBM Streams Processing Language Specification
implementing pass-by-value,since there is no observable difference to the user.
Note that the pass-by-reference semantics of parameters stands in contrast to the
deep-copy semantics of assignments (For more information,see “Value semantics”
on page 14).Function parameters have by-reference semantics (like T &v in C++),
because copying large data into a function would be inefficient when operating
only on a small part of that data.The following example illustrates SPL's
parameter passing semantics:
void callee(mutable list<int32> x,mutable list<int32> y) {//1
x[0] = 1;//2
y = [3,4];//3
}//4
void caller() {//5
mutable list<int32> a = [0,0],b = [0,0,0];//6
callee(a,b);//7
}//8
The assignment x[0]=1 in the callee yields a[0]==1 in the caller.And the
assignment y=[3,4] in the callee yields b==[3,4] in the caller.If the caller passes a
computed value that does not have a storage location,then the callee stores it in a
temporary,but modifications have no side effect on the caller.If you prefer
by-value semantics for parameters,you can easily emulate them by copying the
by-reference parameters into local variables.
Side-effects
Code like y = (x = 5) + x−−;is hard to read and brittle,because it has multiple
side effects in a single statement,so its meaning depends on the statement-internal
evaluation order.The situation is even worse if the same expression calls multiple
functions with side effects.For example,the meaning of y = foo(x,5) + bar(x);
not only depends on evaluation order,but furthermore,depends on the definition
of foo and bar.For example,x may be a list,and foo and bar may be push and
pop,respectively.Statements with multiple side effects are not only hard to
understand for a human,but they are also hard to optimize for a compiler.In the
absence of side-effects,compilers often optimize by reordering or even
parallelizing independent code and eliminating redundant code.Fortunately,even
in imperative languages like C and Java,expression-internal side effects are
uncommon,and referential transparency is common.Unfortunately,without
language support,this is hard to establish in a compiler.Therefore,SPL is designed
to make side effects more explicit,and to encourage a coding style where
side-effects are less common.
In previous sections,we already saw a few features and design decisions that curb
side effects:
v Mutable composite data is never aliased.Since SPL has no pointer type (see
topic “Composite types” on page 11),and since assignments make a deep-copy
even in the case of composite types (see topic “Value semantics” on page 14),
there is no aliasing inside of composite data.That way,a side effect to one
composite variable does not silently corrupt another composite variable.
v Variables are immutable by default (see topic “Statements” on page 21).C++ and
Java allow you to explicitly declare variables immutable with const and final,
respectively,but even though most variables are immutable and could be
declared that way,programmers typically forget to make that explicit.SPL
inverts the default,making mutable an explicit modifier.Variables without that
modifier are deeply immutable.That way,side-effect freedom is more common
and easier to establish for humans and compilers alike.
Chapter 3.Expression language
25
v Collections in for-loops are immutable (see topic “Statements” on page 21).
While a for-loop iterates over a collection,that collection becomes immutable.
That prevents common mistakes where the loop body has an unintended side
effect on the loop control.
In addition,SPL has the following rules to curb side effects:
v Function parameters are immutable by default.In practice,functions that mutate
their parameters are infrequent.They are mostly used to make a small
modification to a large data structure.In SPL,mutable parameters must be
explicitly annotated with the mutable modifier,and all other parameters are
deep-immutable.Thanks to this information,the compiler can produce helpful
errors and perhaps even perform optimizations.For example:
void test(float64 x,list<float64> z) {//1
for (float64 y in z) {//2
print(x);//3
print((x * 100.0)/y);//4
}//5
}//6
Since function print does not modify x,a compiler could hoist the
loop-invariant expression x * 100.0 out of the loop:
void test(float64 x,list<float64> z) {//1
float64 loopInvariantTmp = x * 100.0;//2
for (float64 y in z) {//3
print(x);//4
print(loopInvariantTmp/y);//5
}//6
}//7
Besides enabling optimizations,making function parameters immutable by
default also makes code easier to read and maintain.
v Mutable function parameters are never aliased.One potential loop-hole in the
aliasing prevention described so far could occur when the same data is passed to
multiple function parameters.Consider for example a function copy(count,
srcList,srcIdx,mutable dstList,dstIdx) that copies count elements of
srcList starting at srcIdx to dstList starting at dstIdx.If the two lists are the
same,then the copy may overwrite some of the elements that it reads.For
example,a call like copy(length(x) − 1,x,0,x,1) would be brittle,because
both srcList and dstList are aliased to x,and because dstList is mutable.
Therefore,SPL disallows any mutable parameter to be aliased with any other
parameter in the same function call.
v Functions are stateless by default.A stateful function is a function that is not
referentially transparent or has side-effects.A function is not referentially
transparent if it does not consistently yield the same result each time it is called
with the same inputs.A function has side-effects if it modifies state observable
outside the function.For the purposes of this definition,“state observable
outside the function” includes global variables in native code,and I/O to the
console,files,the network,and so on,but excludes mutable parameters.Mutable
parameters are handled separately because,as the loop invariant code motion
example above shows,they have separate optimization opportunities (print is
stateful but its parameter can be hoisted).Here is an example that illustrates
how code that uses stateless functions is easier to understand and optimize:
int32 ackermann(int32 m,int32 n) {/* do something expensive */return 0;}//1
int32 test(int32 m,int32 n) {//2
int32 x = ackermann(m,n);//3
int32 y = ackermann(m,n);//4
return x + y;//5
}//6
26
IBM InfoSphere Streams Version 2.0.0.4:IBM Streams Processing Language Specification
If the ackermann function is stateless and has immutable parameters,then a
compiler could eliminate one of the calls:
int32 ackermann(int32 m,int32 n) {/* do something expensive */return 0;}
int32 test(int32 m,int32 n) {
int32 x = ackermann(m,n);
int32 y = x;
return x + y;
}
To make statelessness easy to determine,all functions in SPL are stateless
unless they are explicitly annotated with the stateful modifier.(For language
experts:Functions that are stateless and have no mutable parameters are
pure.Immutable parameters curb context-specific side effects,whereas
statelessness curbs context-independent side effects.) When designing SPL,we
considered to categorically outlaw stateful functions altogether.However,
some stateful functions are useful,for example,print,or functions that
interact with external resources such as databases.Furthermore,stateful
functions can yield better performance through memoization.Therefore,we
decided to permit them in SPL,but the language design encourages mostly
writing stateless functions.
v State written by a statement must not be used elsewhere in the same statement.
This comes back to the examples from the beginning of this section.This rule
disallows code like y = (x = 5) + x−−;,since x is written in one part and used
in another part of the statement.The various rules related to functions above
also enable the SPL compiler to check this rule for statements involving function
calls.For example,y = foo(x,5) + bar(x);is not allowed if either foo or bar
has a mutable parameter.This restriction makes code more readable,prevents
common programming mistakes,and may lead to more optimization
opportunities.
Together,these rules mean that for most statements,the compiler is free to
implement any internal expression evaluation order,and the user can not observe
the difference.The only exception is expressions involving floating point numbers,
which the compiler must always implement such that they evaluate left-to-right.
Chapter 3.Expression language
27
28
IBM InfoSphere Streams Version 2.0.0.4:IBM Streams Processing Language Specification
Chapter 4.Operator invocations
The purpose of SPL is to allow users to create streams of data,which are
connected and manipulated by operator invocations.SPL programs are designed to
be deployed on distributed hardware for scaling [1 on page 69].The main goals of
SPL are scalability (exploiting distributed hardware),extensibility (encapsulating
low-level code in high-level operators),and usability (easing the writing of scalable
and extensible code).
A stream is an infinite sequence of tuples.As we saw in the topic Chapter 2,
“Types,” on page 7,a tuple is simply an instance of a tuple type,consisting of
named attributes.Each stream is the result of an operator invocation.An operator
invocation observes data from zero or more input streams and defines zero or
more output streams.Every stream is defined by exactly one operator invocation,
but can be consumed by any number of operator invocations.Each time a tuple
arrives on any one of the input streams of an operator invocation,the operator
fires,and can produce tuples on its output streams [6 on page 69].An operator
invocation in SPL returns streams,analogously to how a function invocation in a
traditional language returns values.However,given that a stream is an infinite
sequence of tuples,each operator invocation is active for the duration of the
program execution,firing repeatedly once for each of its inputs tuples.This section
shows how to define streams of tuples by invoking operators.The syntax for
defining a stream,for example:
stream<int32 i> Strm1 = SomeOperator(Strm2)
is designed to resemble the syntax for defining a variable,for example:
list<int32> list1 = someFunction(list2)
Here is a typical example for an operator invocation,repeated from the
introduction:
stream<rstring buyer,rstring seller,rstring item> Sale = Join (Bid;Ask ) {//1
window Bid:sliding,time (30);//2
Ask:sliding,count (50);//3
param match:Bid.item == Ask.item && Bid.price >= Ask.price;//4
output Sale:item = Bid.item;//5
}
Each operator invocation has a head and a body.The head (Line 1 in the example)
lists the connected streams and the operator used to process these streams.The
body (Lines 2-6 in the example) elaborates on how the operator is to be invoked.
The BNF syntax is:
opInvoke::= opInvokeHead opInvokeBody
SPL supports a wide variety of operators with the default toolkits shipped with the
compiler,and furthermore,developers can extend SPL with new operators.Each
operator can be customized and reused in different places in a data flow graph.To
support this customization,SPL supports a versatile customization syntax.All the
configuration happens in the operator invocation body,to avoid tangling it with
the data flow specification in the operator invocation head.The operator body can
have five clauses,though some are usually omitted.SPL separates the body of
operator invocations into clauses to make them easy to read.Below,we extend the
previous example to illustrate all five clauses:
© Copyright IBM Corp.2009,2012
29
composite SaleJoin {//1
graph//2
stream<rstring buyer,rstring item,decimal64 price>//3
Bid = FileSource() { param file:"bids.txt";}//4
stream<rstring seller,rstring item,decimal64 price>//5
Ask = FileSource() { param file:"asks.txt";}//6
stream<rstring buyer,rstring seller,rstring item,uint64 id>//7
Sale = Join(Bid;Ask)//8
{//9
logic state:mutable uint64 n = 0;//10
onTuple Bid:n++;//11
window Bid:sliding,time(30);//12
Ask:sliding,count(50);//13
param match:Bid.item == Ask.item && Bid.price >= Ask.price;//14
output Sale:item = Bid.item,id = n;//15
config wrapper:gdb;//16
}//17
}//18
The five operator invocation body clauses are:
v The logic clause consists of local state that persists over the whole program
execution,and statements that execute when a tuple arrives on an input port
(see topic “Logic clause” on page 33).
v The window clause specifies how many previously received tuples of each port
to remember for later processing by stateful operators such as Join,Sort,or
Aggregate (see topic “Window clause” on page 35).
v The param clause contains code snippets,such as expressions,supplied to the
operator at invocation time;at runtime,the operator executes them whenever
needed for its behavior (see topic “Param clause” on page 38).
[ present and
in multithreaded context?]
acquire lock
logic state
execute statement, if anylogic
process tuple
process_raw
window library
[in multithreaded context?]
acquire lock
retrieve tuples from other window
execute expressionparam match
[match succeeded?]
execute assignmentsoutput
[in multithreaded context?]
release lock
loop
[ present and
in multithreaded context?]
release lock
logic state
native operator code
insert tuple into receiving port’s window
[match succeeded?] submit tuple
submit punctuation
Figure 3.Clause activation sequence example for Join operator
30
IBM InfoSphere Streams Version 2.0.0.4:IBM Streams Processing Language Specification
v The output clause assigns values to attributes in output tuples each time the
operator submits data to an output stream (see topic “Output clause” on page
38).
v The config clause gives directives and hints that influence how the compiler