Data Stream Processing: Understanding Punctuations and Their Role in Query Processing - Pr, Study notes of Computer Science

The concept of punctuations in data stream processing and their impact on query processing. It covers the reasons why punctuations are important, their sources, and how they can help unblock group-by and join operators. The document also explores the concept of punctuated streams in haskell and the behavior of stream iterators.

Typology: Study notes

Pre 2010

Uploaded on 08/16/2009

koofers-user-rwi
koofers-user-rwi 🇺🇸

5

(1)

10 documents

1 / 25

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Data Streams 07: Lecture 6
1
CS 410/510
Data Streams
Lecture 6: Punctuation
David Maier, Kristin Tufte
h h k
10/10/2007 Data Streams: Lecture 6 1
wit
h
t
h
an
k
s
t
o
Pete Tucker
Overview
Introduction and Initial Work
Fth Q ti
F
ur
th
er
Q
ues
ti
ons
Stream Iterator Framework
Theory of Punctuation Semantics
Performance
Benefiting Entire Query Plans
10/10/2007 Data Streams: Lecture 6 2
Benefiting Entire Query Plans
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19

Partial preview of the text

Download Data Stream Processing: Understanding Punctuations and Their Role in Query Processing - Pr and more Study notes Computer Science in PDF only on Docsity!

CS 410/

Data Streams

Lecture 6: Punctuation

David Maier, Kristin Tufte

h h k

10/10/2007 Data Streams: Lecture 6 1

with thanks to

Pete Tucker

Overview

„ Introduction and Initial Work

‰ FF urther Questionsth Q ti

„ Stream Iterator Framework

„ Theory of Punctuation Semantics

„ Performance

„ Benefiting Entire Query Plans

10/10/2007 Data Streams: Lecture 6 2

„ Benefiting Entire Query Plans

Example

„ Online auction

management system

Person

„ Three kinds of data

streams: Persons,

Auctions, Bids

Person(p_id, name, email, city, state) Auction(a_id, expires, seller category)

Auction

Person

Online

Auction

System

10/10/2007 Data Streams: Lecture 6 3

seller, category) Bid(a_id, bidder, hour, minute, second, price) (^) BidBidBid

BidBid

Auction

Categories

Example Query

Suppose we wanted to

report the closingreport the closing

price for each auction

in a specific category

SELECT A.a_id, MAX(price)

σ

10/10/2007 Data Streams: Lecture 6 4

FROM Auction A, Bid B

WHERE A.a_id=B.a_id AND

A.category=

GROUP BY A.a_id Auction^ Bid

σ

Generalizing End of Input (EOI)

„ EOI tells an operator that no more data

will arrive from a given inputwill arrive from a given input

„ Blocking operators can output results

„ Stateful operators can purge state

„ Idea – items in the stream denoting the

end of data subsets might improve blocking

10/10/2007 Data Streams: Lecture 6 7

end of data subsets m ght mprove block ng

and stateful operators

Punctuations

„ Apunctuation describes a subset of data in a stream

„ A data item d is said tomatch a punctuation p

ifif dd b lbelongs to the subset described by s t th s bs t d s ib d b p

„ Punctuations in a stream will indicate no more data items will occur that match that punctuation

„ A punctuated stream isgrammatical if, for

each punctuation p , no following data item matches p

10/10/2007 Data Streams: Lecture 6 8

p

An auction is complete when its expiration time passes.

The bid stream source can emit a punctuation when an

auction expires, indicating no more bids for that

auction will arrive

Example Query, part 3

End of auction punctuations allow us to reduce size ofallow us to reduce size of state for join End of auction punctuations unblock group-by

σ

10/10/2007 Data Streams: Lecture 6 9

SELECT A.a_id, MAX(price) FROM Auction A, Bid B WHERE A.a_id=B.a_id AND A.category = 10 GROUP BY A.a_id

Auction Bid

σ

Consequences of Punctuations

„ Blocking operators may produce some output before end of streamoutput before end of stream

‰ Group-by outputs results corresponding to

groups that match punctuations

„ Stateful operators may keep less state

‰ Join keeps only data items that do not match

punctuation

10/10/2007 Data Streams: Lecture 6 10

„ Any operator should output punctuation whenever possible

Sources of Punctuations

„ Source or sensor intelligence

‰ Source clock – end of hour

‰‰ Max skew and latencyMax skew and latency

„ Knowledge of access order

‰ Sorted data

„ Knowledge of stream or application

semantics

„ Auxiliary information

10/10/2007 Data Streams: Lecture 6 13

y

„ Stream operator behavior

‰ Windowed multi-join

Initial Testing

„ Conducted an ad hoc test of our ideas

‰‰ Niagara 1 0 (Java version)Niagara 1.0 (Java version)

‰ Firehose

‰ Simple query using union (with duplicate

elimination), group-by

„ Promising results

‰ Group-by unblocked

10/10/2007 Data Streams: Lecture 6 14

‰ Group by unblocked

‰ Union state size reduced (for data items)

‰ Minimal overhead

Test Implementation

„ Query – output maximum temperature each hour

reported from many warehouse sensors:

SELECT hSELECT hour, MAX(temp) MAX(t )

FROM (SELECT * FROM sensor

UNION

SELECT * FROM sensor

UNION

UNION

SELECT * FROM sensorN)

GROUP BY hour;

10/10/2007 Data Streams: Lecture 6 15

„ Modified a few Niagara operators to make use of

punctuation

‰ Only those necessary to make our test query work

Initial Test Results

Query Performance State Size for Union Operator

0

10

20

30

40

50

0 1 5 10 30 Punctuations per Hour

Time (Sec)

Fi t O t t L t O t t

0

500

1000

1500

1 127 253 379 505 631 757 883 1009 1135 1261 1387 No. of Tuples Arrived

Tuples in State

10/10/2007 Data Streams: Lecture 6 16

First Output Last Output (^) No Punctuation With Punctuation

  • Test run over data spanning 60 hours
  • Group-by unblocked (note when first result occurred)
  • Memory required was reduced

Representation of Streams

„ Stream represented as a

"sliced list"sliced list S = [[ 1 2 ] [ 3 ] [ ] [ 4 5 ] [ 6 ] ]

‰ Easier to model a finite stream

‰ Variability in input arrival rate vs. operator processing rate

‰ Interleavings of multiple inputs

„ Notation

S = [[ 1 , 2 ], [ 3 ], [ ], [ 4 , 5 ], [ 6 ], … ]

10/10/2007 Data Streams: Lecture 6 19

„ Notation

‰ S [ i ]: First i slices of S

‰ S @ i : i- th slice of S

S [ 2 ] = [ 1 , 2 , 3 ] S [ 4 ] = [ 1 , 2 , 3 , 4 , 5 ]

S @ 2 = [ 3 ] S @ 4 = [ 4 , 5 ]

Stream Iterator

„ Not all stream-to-stream functions are

suitable: sort on positive numberssuitable: sort on positive numbers

„ Astream iterator is a function that

accesses input incrementally

‰ We want to avoid functions that must access

the entire input

10/10/2007 Data Streams: Lecture 6 20

‰ f ( S ) = q ( S @ 1 , st 0 ) ++ q ( S @ 2 , st 1 ) ++ … ++

q ( S @ i , sti-1 ) ++ …

where stj = r ( S @ j, stj-1 )

Punctuated Streams in Haskell

„ Each tuple may be Either a data item or

punctuationpunctuation

‰ Constructors Left , Right

‰ New class Pattern New class Punc is a tuple of Pattern s

10/10/2007 Data Streams: Lecture 6 21

[[Left 1, Left 5], [Left 3], [Right (Range (0,4))], [Left 5,Left 6,Left 7], … ]

type Stream a b = [[Either a b]]

Representing Stream Iterators

(No Punctuation Case)

„ A stream iterator is a 3-tuple:

‰‰ initial stateinitial_state : operator state before data items: operator state before data items

arrive

‰ step : function called when new slice arrives

‰ final : function called when input stream ends

„ Encapsulated in a single data type

data Basic state input output = B ([input] -> state -> ([output],state)) -- step

10/10/2007 Data Streams: Lecture 6 22

([ p ] ([ p ], )) p (state -> ([output],state)) -- final --duplicate elimination ([], step, final), where step xs st = ((nub xs \ st), union st xs) final st = ([], [])

Stream Iterators for Punctuated

Streams

„ Punctuated stream iterator has 5 parts:

‰ initial_state : State at beginning of execution

‰ step : New results and updated state

‰ pass : New result data due to punctuations

‰ prop : New output punctuations due to punctuations

‰ keep : Update state due to punctuations

10/10/2007 Data Streams: Lecture 6 25

p p p

data Basic state input inputp output outputp = B ([input] -> state -> ([output],state)) --Step ([inputp] -> state -> [output]) --Pass ([inputp] -> state -> [outputp]) --Prop ([inputp] -> state -> state) --Keep

Common Behavior Function for

Punctuated Streams

„ Common behavior of punctuated-stream iterators

‰ Read input slice, separate data items from punctuationsp , p p ‰ Output appropriate result data items and punctuation ‰ Manage state

„ Customize via calls to step, pass, prop, keep

unary :: s -> (Basic s it ip ot op) -> Stream it ip -> Stream ot op unary st (basic@(B step pass prop keep)) (xs:rest) = [map norm tsOut ++ map norm tsExtra ++ map punct psOut] ++ ( tN ' b i t)

10/10/2007 Data Streams: Lecture 6 26

++ (unary stNew' basic rest) where (ts,ps) = splitPunc xs ([],[]) (tsOut,stNew) = step ts st tsExtra = pass ps stNew psOut = prop ps stNew stNew' = keep ps stNew

Example Stream Iterators

--select iterator

--dupelim iterator d li S St b > St b

state reduced

selectS :: (a -> Bool) -> Stream a b -> Stream a b selectS pred = unary [] (B step passT prop keepT) where step ts st = (filter pred ts, []) prop ps st = ps

10/10/2007 Data Streams: Lecture 6 27

dupelimS :: Stream a b -> Stream a b dupelimS = unary [] (B step passT prop keep) where step ts st = ((nub ts \ st), union st ts) prop ps st = ps keep ps st = setNomatchTs st ps

Question 2: “How do we know our

iterators are behaving reasonably?”

„ Have implementations of iterators on

punctuated streamst t d t

„ What does it mean to behave

reasonably?

10/10/2007 Data Streams: Lecture 6 28

‰ Data items output

‰ Punctuations emitted

Punctuation Invariants

„ Punctuation invariants define cumulative behavior

for a stream iteratorfor a stream iterator

‰ Based on the arrival of some prefix of input, what should be done

„ Three kinds of invariants (based on input data

and punctuation)

P h l b

10/10/2007 Data Streams: Lecture 6 31

‰ Pass invariant – What results can be output

‰ Propagation invariant – What punctuations can be output

‰ Keep invariant – What state must be maintained

Pass Invariants

„ Pass invariants take the form

cpass(T (T 1 , PP 1 , …, TT (^) n , PP (^) n) ) = T Toutout

‰ Note the ' c ' to denote cumulative behavior

„ Examples

select cpass(T 1 ,P 1 ) σ( T 1 )

difference cpass(T P T P ) { t | tTtTsetMatch ( t P )}

10/10/2007 Data Streams: Lecture 6 32

difference cpass(T 1 ,P 1 ,T 2 ,P 2 ) { t | tT 1tT 2setMatch ( t, P 2 )}

Keep Invariants

„ Keep invariants take the form

ckeepckeep (^) j(T(T 1 , PP 1 , …, TT (^) n , PP (^) n) = ) = ŤŤj

‰ Ťj indicates data items held in state from input j

„ Examples

select ckeep 1 (T 1 ,P 1 ) [] dupelim ckeep 1 (T 1 ,P 1 ) setNomatchTs ( T 1 ,P 1 )

10/10/2007 Data Streams: Lecture 6 33

difference ckeep 1 (T 1 ,P 1 ,T 2 ,P 2 ) [ t|tT 1tT 2setNomatch ( t, P 2 )] ckeep 2 (T 1 ,P 1 ,T 2 ,P 2 ) setNomatchTs ( T 2 , P 1 )

Proof Strategy for Faithfulness

„ Prove an iterator implementation is

faithful and proper to its corresponding

table operatortable operator

„ Two-stage proof

‰ Step 1: Prove invariants imply faithfulness and propriety

Step 2: Prove a particular iterator

10/10/2007 Data Streams: Lecture 6 34

‰ Step 2: Prove a particular iterator implementation conforms to invariants

Performance Scenario

„ Online auction scenario discussed earlier

„ NiNi agara query enginei

‰ Two versions – with and without punctuation

enhancements

„ Generally two query plans for each query

‰ With and without Describe

Seven versions of each stream

10/10/2007 Data Streams: Lecture 6 37

„ Seven versions of each stream

‰ Punctuations on: Nothing, a_id, hour, 15-minute period, 1-minute period, 30-second period, 15-second period

Performance Query 1

  1. Currency Conversion

SELECT bidder, hour,

π b,h,DOL2EUR(p)

SELECT bidder, hour,

DOLTOEUR(price)

FROM bid1;

„ No punctuations required

„ (Optional) Describe on no

attributes – filters out all punctuations

bid

10/10/2007 Data Streams: Lecture 6 38

punctuations

„ Indicates query overhead

when punctuations not required

Performance Query 3

3. Bid Counts

SELECT hour, COUNT(*)

FROM bid1FROM bid

GROUP BY hour;

„ Group-by operator is blocking

‰ Query requires help from punctuations

„ Describe on the hour

tt ib t

bid

10/10/2007 Data Streams: Lecture 6 39

attribute

‰ “build up” on minute attribute if needed

Performance Query 4

4. Closing Price for Auctions in

Specific Categoriesp g

SELECT B.a_id, MAX(B.price)

FROM auction A, bid1 B

WHERE A.a_id=B.a_id AND

A.category IN {92,136,208,294}

GROUP BY B.a_id;

„ Join – unbounded state

„ Group by – blocking

σc IN {92,136,208,294}

a_id=a_id

10/10/2007 Data Streams: Lecture 6 40

„ Group-by – blocking

‰ Requires propagation through join

„ Describe on auction id

‰ Nothing to build up

auction bid