Special Interest Activity

Download Report

Transcript Special Interest Activity

Special Interest Activity
Trio – A System for Integrated Management of Data,
Uncertainty, and Lineage
ITK478
Yan Cui
Agenda











Understand Trio new database system
System requirements of installing Trio system
Source codes/packages for Trio system
Procedure of Trio system installation
Configuration of Trio System
Experiment Trio DBMS Using TrioExplorer and
TrioPlus (demo)
Trio API Integrated Into Other Python Scripts
Trio API and Translator (Python)
Trio Query Language (TriQL) Structure (demo)
Advantage and Disadvantage of using Trio DBMS
Documenting Trio system report/bugs
Understand Trio new database system

Basic concept of new Trio DBMS
◦ Trio is a new kind of database system (DBMS), which was
developed by Stanford University Lab at Dec, 2006.
◦ An extended relational model called Uncertainty Lineage
Database (ULDB) , and also supports Trio’s query language
Called TriQL [1].
◦ Handles structured data, uncertainty of data, and data lineage
together in a fully integrated manner.
Trio System architecture (components)
 Uncertainty-Lineage Database (ULDB)
 TriQL: The Trio Query Language

Trio System architecture

Four primary components of Trio DBMS
◦
◦
◦
◦

command-line client (TrioPlus)
TrioExplorer
Trio API and translator (Python)
standard relational DBMS (PostgreSQL)
Uncertainty-Lineage of Database
◦
◦
◦
◦
Encoded data table
Lineage table
Trio metadata
Trio Stored Procedures
Uncertainty-Lineage Database (ULDB)





Alternatives
‘?’ (Maybe) Annotations
Numerical Confidences
Lineage
Sample: Drives (person, color, car) and Saw (witness,
color, car) uncertainty tables with/no with confidence.
Uncertainty-Lineage Database (ULDB)

Alternatives
◦ Definition: Alternatives are presenting uncertainty about the
contents of a tuple [2]
◦ ‘||’ annotation
◦ Drives(person, color, car) and Saw (witness, color, car)
uncertainty tables
◦ ‘Select * from Drives”
Uncertainty-Lineage Database (ULDB)

‘?’ (Maybe) Annotations
◦ Definition: ‘?’ annotation present the existence of a tuple on the
x-tuple, also called maybe x-tuple [2]
◦ Drives(person, color, car) and Saw (witness, color, car)
uncertainty tables
◦ ‘select * from Drives’
Uncertainty-Lineage Database (ULDB)

Numerical Confidences
◦ Definition: Numerical confidence also was considered as
probability [2].
◦ Drives(person, color, car) and Saw (witness, color, car)
uncertainty tables with confidence
◦ ‘select * from Drives’
Uncertainty-Lineage Database (ULDB)

Lineage
◦ Definition: “recorded at the granularity of tuple alternatives:
Lineage connects as x-tuple alternative to other x-tuple
alternative.” in [2]
◦ Drives(person, color, car) and Saw (witness, color, car)
uncertainty tables with confidence
◦ ‘select person from Drives’
TriQL: The Trio Query Language

Two major parts of TriQL
◦ built-in functions and predicates for querying confidence values
and lineage
◦ regular SQL syntax
System requirements of installation

Operating systems - as Linux, Mac OS X, and Win-32 (XP,Vista,
and 32-bit Server)
PostgreSQL database (version 8.2.5, 8.1.10, 8.0.14, and
7.4.18) - Linux and Win32
 Python API - windows, Linux/Unix, Mac OS X, OS/2, and Amiga

Source codes/packages for Trio system
Listing source codes












Python 2.4 can be downloaded from http://www.python.org/ .
Easy_install can be downloaded from
http://peak.telecommunity.com/DevCenter/EasyInstall and the file
called ez_setup.py.
Readline 1.7.win32 can be downloaded from http://www.python.org/ .
ctypes-1.0.2.win32-py2.4 can be download from
http://www.python.org/ .
PostgreSQL 8.1 can be downloaded from http://www.postgresql.org/
Graphviz 2.14 is the only version compatible with Trio API. It is
available in http://infolab.stanford.edu/trio/code/graphviz-2.14.1.exe .
PyGreSQL can be downloaded from http://www.pygresql.org/ .
Pylons 0.9.5 can be downloaded from http://pylonshq.com/ .
PLY 2.2 can be downloaded from http://www.dabeaz.com/ply/ .
PyParsing can be downloaded from http://pyparsing.wikispaces.com/ .
PyDot can be downloaded from
http://code.google.com/p/pydot/downloads/list .
Trio API 1.0 can be downloaded from
http://infolab.stanford.edu/~theobald/sources/TRIO.zip .
Procedure of Trio system installation

Python
◦ Python 2.4 windows version (python-2.4.4.msi).
◦ Install Python in C:/Python directory
◦ Set path=c:/Python24; in environment variables

Readline
◦ Readline-1.7.win32-py2.4.exe and install into Python directory

Ctypes
◦ Ctypes-1.0.2.win32-py2.4.exe and install into Python directory

PostgreSQL
◦ PostgreSQL 8.1 windows version (postgresql-8.1.msi )
◦ Install PostgreSQL 8.1 and set path C:\Program
Files\PostgreSQL\8.1\bin; after completed the installation.

Graphviz
◦ Graphviz 2.14 version, install to your workstation and set path
C:\PROGRA~1\ATT\Graphviz\bin; in environment variables after
completed the installation
Procedure of Trio system installation (cont)







Easy_install - Download ez_setup.py in C:/ directory
PyGreSQL – In command line, cd\ to c: directory, and run python
ez_setup.py PyGreSQL to install components.
Pylons – In command line, cd\ to c: directory, and run python
ez_setup.py Pylons==0.9.5 to install Pylons. Set path
c:\python24\Scripts in environment variables.
PLY – In command line, cd\ to c: directory, and run python
ez_setup.py Ply==2.2 or easy_install Ply==2.2.
PyParsing – In command line, cd\ to c: directory, and run python
ez_setup.py PyParsing.
PyDot – download the source from website. Access to folder in
command line, and then install manually by running ‘python
setup.py install’.
Trio API
◦ Download source code in any directory
◦ Copy Trio-1.0\spi\triospi_win32.dll to PostgreSQL’s lib directory
and renamed as triospi.dll
Procedure of Trio system installation (cont)
PostgreSQL installation
Configuration of Trio System
Windows superuser authentication to access
PostgreSQL
 TrioExplorer
 TrioPlus

Configuration of Trio System

Windows superuser authentication for PostgreSQL
◦ Open pgadmin III for PostgreSQL
◦ Create new login role
 Role name ‘myname’ (as same as windows login account)
 Set password (Password can be any) and check all role privileges and
click ok.
◦ Create new database
 Database name ‘myname’ (as same as username)
 Owner is ‘myname’ and click ok
◦ Initialize Trio schema information
 In Trio-1.0\setup, open setup.py with notepad to comment out the
last three codes and put the following.
os.system("psql %s %s < setup.sql" % (pgdbname, username))
os.system("psql %s %s < setup_triospi.sql" % (pgdbname, username))
os.system("psql %s %s < trio_get_conf.sql" % (pgdbname, username))
 Save the file, and at the command line, cd \Trio-1.0\setup, and run
‘python setup.py myname myname’
Configuration of Trio System (cont)

pgadmin III for PostgreSQL 8.1
Configuration of Trio System (cont)

TrioExplorer
◦ Make sure PostgreSQL is working.
◦ Running TrioExplore – Ensure path ‘c:\python24\Scripts;’ in
environment variables. And double click ‘start_te_server.bat’
under Trio-1.0\explorer.
◦ At the command line, you are now prompted for an admin user
login to PostgreSQL, which should have been created along with
your PostgreSQL installation and which will be used by
TrioExplorer to create new user roles and database instances.
◦ TrioExplorer should now be reachable from your browser using
http://localhost:8080/. For new users can now press ‘Create a
new user’ and create their own Trio login and database instances,
which are then managed by the PostgreSQL server.
Configuration of Trio System (cont)

TrioPlus
◦ Create new PostgreSQL user role and database instance
 Run ‘createuser demo’
 Run ‘createdb demo’, the name must be the same as username
◦ Initialize Trio schema information for new user by access as same
as windows superuser authentication to access PostgreSQL. Use
TrioExplorer will be easily just press ‘Create new role’ in Web.
◦ Connect to new Trio database using the command line clients by
running ‘python trioplus.py –u demo –d demo –p’
Trio API Integrated Into Other Python
Scripts

TrioCnx
◦
◦
◦
◦
◦

TrioCnx(pgdb) –PyGreSQL connection pgdb
cursor() – return a new TrioCursor object for the current connection.
commint() – commits the current transaction.
rollback() – performs a rollback for the current transaction.
close() – closes the Trio connection
TrioCursor
◦ execute(triql) - Executes a TriQL statement triql for this cursor
object.
◦ fetchone() - Fetches a single XTuple object from the current cursor
position.
◦ fetchall() - Fetches and returns a list of all XTuple objects beginning
from the current cursor position.
Trio API Integrated Into Other Python
Scripts (cont)

XTuple
◦ len() - Returns the number of Alternative objects contained in this
XTuple object.
◦ getAlternative(idx) - Returns the Alternative object at the designated
index idx.
◦ getConfidence() - Returns the confidence value (if any) of this XTuple
object as the sum of its Alternative objects' confidence values.
◦ getQuestionMark() - Returns whether this XTuple object has a
question mark or not.

Alternative
 getLineage() – returns a list of immediate lineage information
 traceLineage() – performs a transitive lineage traversal for this
alternative back to the base data
 getConfidence() - Returns the confidence value of this alternative.
 computeConfidence() - Computes the confidence value
Trio API And Translator (Python)
TriQL built-in functions
TriQL keywords
Aggregate(), Alternative(), As(), Avg(),
BinaryExpression(), Brackets(), Cascade(),
ColumnList(), Column(), Command(),
ComputeConfidences(), Conf(), ConfUniform(),
ConfScaled(), Count(), CreateIndex(),
CreateTableAs(), CreateTable(),
CreateTempTableAs(), CreateTempTable(),
DataType(), Dot(), DropIndex(), DropTableList(),
DropTable(), Eavg(), Ecount(), EcountStar(),
Emax(),
Emin(), Empty(), Esum(), FromClause(),
GroupAlts(), GroupbyClause(), GroupByKey(),
HavingClause(), HorizontalSelect(),
IdentifierList(), InsertList(), Insert(),
InsertSpec(), Lineage(), Literal(), Lsum(), Max(),
Maybe(), Maybe(), Min(), OrderBy(),
Question(), SelectClause(), Select(),
SelectOptions(), SetOperator(), Star(), Sum(),
UnaryExpression(), UncertainSet(),
UniformTable(), ViewTable(), WhereClause(),
WithConfidences()
'all', 'and', 'any', 'as', 'avg', 'by', 'cascade',
‘compute', 'conf', 'confidences', 'count',
'create', 'distinct', 'drop', 'eavg', 'ecount',
'emax', 'emin', 'esum', 'except', 'exists',
‘flatten', 'float', ‘float4', 'float8', 'from',
'group', 'groupalts', 'having',
'in', 'index', 'insert', 'int', 'int4', 'int8',
'intersect', 'into', 'is', 'like', 'lineage', #'lavg',
#'lcount', #'lmax', #'lmin', #'lsum', 'max',
'maybe', 'min', 'noconf', 'nolineage',
'nomaybe', 'not', 'null', 'on', 'or', 'order',
'scaled', 'scaledbyexp', 'select', 'sum', 'table',
'temporary', 'trio', 'trio_aid', 'trio_xid',
'uncertain', 'uniform', 'union', 'merged',
'values', 'varchar', 'view', 'where', 'with'
Demo
TrioExplorer and TrioPlus
 Trio Query Language (TriQL) Structure

◦
◦
◦
◦
◦
Drop index/table
Create Trio table/index
TriQL language
TriQL manual: http://infolab.stanford.edu/~widom/triql.html
Sample: TriQL Script
Current Supported for TriQL
Supported DDL and Insert Commands

Create Table

Create Trio Table

Create Index

Drop Table Including “Cascade” option to drop
derived tables
•
Drop Index
•
Insert Into T Values <data>
o With or without alternatives,
confidence values, and ?
o T must be a base table
•
Insert Into T <TriQL query>
o T must be a base table
Supported Subset of TriQL Query Language

Select-Project-Join queries
•
Create Table T as <query>

Select Distinct
•
Union, Intersect, or Except of two subqueries (duplicateeliminating)
•
Union All of two subqueries
•
Merged, Flatten, and GroupAlts

Horizontal subqueries in Where clause
o
Including shortcuts and aggregation/group-by/having
•
Horizontal subqueries in Select clause
o Including shortcuts and aggregation but not groupby/having

Conf() function with any number of tables or Conf(*)
o
except for Conf(*) in the SELECT clause in conjunction
with DISTINCT or MERGED

Maybe() predicate

“Uniform <table-name>” in From clause
•
“as conf” for query-defined result confidence values
o Including “Uniform as conf” and “Scaled as conf”
•
“Compute Confidences” at end of query
•
Lineage() predicate
o Not in horizontal subqueries
o “=⇒” abbreviation allowed
•
NoLineage, NoConf, and NoMaybe
Table of TriQL contents
ULDBs
SQL over ULDBs
Flatten and GroupAlts
Horizontal subqeries: The [ ] construct
Built-in Functions Conf() and Maybe()
Uncertainty attributes, maybe annotations and
confidence values
Selection, projection, join, subqueries, duplicateelimination, grouping and aggregation, aggregate
variants, set operators, order by
Flatten is used to turn tuples with alternative
values into regular tuples, while GroupAlts is
used to create or restructure alternative values
[ ] in the where clause, [ ] with joins, Syntactic
shortcuts in [ ], [ ] in the select clause, [ ] with
Self-Joins
Multi-table conf()
Result confidences
Result confidence evaluation, uniform and scaled
result confidences, On-Demand confidence
computation
Built-in Predicate Lineage()
The Lineage() predicate lets queries filter joined
tuples based on whether they are related via
lineage
Options Nolineage, Noconf, and NoMaybe
Indicate lineage, confidence values, and/or?'s
should be omitted from query results
Insert statement, delete statement, update
statement
Data modification
Advantage and Disadvantage

Advantage
◦ Open source and free support for any non-benefit users to
experience new Trio DBMS
◦ Advanced components in relational DBMS
◦ Computing confidences
◦ Efficient, Convenient, safe, Multi-User storage of and access to,
Massive, Persistent

Disadvantage
◦ Time cost for query
◦ Dependency
◦ On development stage
Advantage and Disadvantage

Disadvantage
◦ Time cost for query
Using ‘SELECT attr-list FROM X1, X2, ..., Xn WHERE predicate’ as a
query example in [6] for a comparison between relational database
and ULDB.
Over standard relational database:
 For each tuple in cross-product of X1, X2, ..., Xn
 Evaluate the predicate
 If true, project attr-list to create result tuple
Over ULDB:
 For each tuple in cross-product of X1, X2, ..., Xn
 Create “super tuple” T from all combinations of alternatives
 Evaluate predicate on each alternative in T ; keep only the true
ones
 Project attr-list on each alternative to create result tuple
 Details: ‘?’, lineage, confidences
Documenting Trio system report/bugs




The install instruction in website
http://dbpubs.stanford.edu:8011/doku.php/trio:installation ,
indicated unclearly the version of Graphviz for Trio system. In
Graphviz website http://www.graphviz.org/ only has version 2.16
but not compatable except version 2.14. Graphviz version 2.14 is
available for download in
http://infolab.stanford.edu/trio/code/graphviz-2.14.1.exe .
The windows authentication supperuser needs to be created first
in the PostgreSQL in order to connect to database. After
established the connection, TrioExplorer and TrioPlus can use the
supperuser’s login and password as windows authentiction to
access to database system. However, it doesn’t mention at all in the
installation procedure on how to create this typle of new user. The
only way to solve it is to use PostgreSQL->pgadmin III manually.
After created the supperuser, I have to modify some codes in
setup.py in Trio-1.0->setup directory in order to run ‘python
setup.py –u myname –d myname –p’.
TriQL query statements in
http://infolab.stanford.edu/~widom/triql.html#options, there are
many samples queries not working properly as desired.
References






M. Mutsuzaki, M. Theobald, A. de Keijzer, J. Widom, P. Agrawal, O.
Benjelloun, A. Das Sarma, R. Murthy, and T. Sugihara. Trio-One:
Layering Uncertainty and Lineage on a Conventional DBMS.
Proceedings of the Third Biennial Conference on Innovative Data
Systems Research (CIDR '07), Pacific Grove, California, January
2007. Demonstration description.
O. Benjelloun, A. Das Sarma, C. Hayworth, and J. Widom. An
Introduction to ULDBs and the Trio System. IEEE Data Engineering
Bulletin, Special Issue on Probabilistic Databases, 29(1):5-16, March
2006.
Trio: A System for integrated Management of Data, Uncertainty, and
Lineage. Retrieved on November, 18, 2007 from
http://infolab.stanford.edu/trio/ .
PostgreSQL. Retrieved on November, 20, 2007 from
http://www.postgresql.org/.
Python. Retrieved on November, 15, 2007 from
http://www.python.org/.
Trio: A System for Data, Uncertainty, and Lineage. given by Jennifer
at various venues, 2006-07. Ppt.
Questions?