Transcript Slide 1

Data Warehousing from the Web
Chris Fernandes ([email protected]) and Michael Whalen ([email protected])
Department of Computer Science, Union College, Schenectady, NY, 12308
Summary
Procedure
Results
Warehouse projects provide students
with robust capstone experiences and
produce interesting results.
Meeting scheduler
Course analysis
Enrollment data
Room availability
Introduction
Data warehousing is the process of collecting information
from various repositories and combining it into a single
structured repository that can be queried for new
information such as performance trends. Many Internet
web sites contain useful but unstructured data, thus making
them ideal for student projects related to data warehouses.
We describe one such project, developed from the
registration web pages at Union College, which allows
faculty and students to get on-line access to course
enrollment trends, classroom availability, and other
pertinent information. The results of this project were so
successful in the type of information that could be obtained
that the administration became concerned about student
privacy issues, allowing the student to extend his work into
the area of warehouse security.
HTML data is automatically parsed nightly
for content and transferred to the warehouse
backend, called SCOUR (Search Contents
Of Union’s Registry)
Course
Number
Name Homepage
Term …
…
…
…
…
Query results can be displayed in a variety
of formats, including histograms and
importing results to a spreadsheet.
…
Event
BeginTime EndTime Location
…
…
…
…
…
User
Fname
Lname AdvisorID
…
…
…
…
…
Unlike traditional databases, the SCOUR
warehouse contains historical and
summarized data for use in statistical queries
Conclusions
1. Data warehousing projects yield many pedagogical
benefits. They allow students to build bridges
between many areas of computer science including
• database and data warehouse theory
• GUI design
• security and authorization
• interface usability
• privacy and ethics
2. Projects can be diverse. Raw data abounds on the
Web in many fields.
3. Projects can have flexible scope to meet time
constraints. One can easily extend a warehouse
with security to restrict access to sensitive queries.
UNION registrar web pages
contain semistructured data
Dynamic web-based front ends were created for
a variety of queries. It was essential to maintain
ease of use by non-technical operators.