Microsoft Research Overview Talk to Homeland Security Dept

Download Report

Transcript Microsoft Research Overview Talk to Homeland Security Dept

Microsoft Research
and
Big Databases
Information at your fingertips
Jim Gray & Tom Barclay
[email protected] & [email protected]
Microsoft Research
Presentation to US Dept. Homeland Security
7 April 2004
1
Outline
•
•
•
•
•
Overview of Microsoft Research
Big-Database Research
TerraServer: Geospatial app
SkyServer: data mining app
Q&A
2
Most R&D Is D
How to Do Basic Research in Industry?
Critical questions (from Rick Rashid)
• How can I
create and maintain a world class research
organization in an industrial setting?
• How do I
keep the lines of communication open
between product teams and researchers?
• How do I
get new technology into products quickly?
4
Approach
Adapt the Academic Model
• Organizational goal: Advance state of the art
• University organizational model
– Flat structure, critical mass groups
• Open research environment
– Aggressive publication in peer-reviewed literature
– Frequent visitors, daily seminars
• Strong ties to University Research
– Nearly 15% of basic research budget
directly invested in Universities
• Lab grants, research grants, fellowships, etc.
– Hundreds of interns and visitors
5
Microsoft Research
•
•
•
•
Founded in 1991
Staff of over 700 in over 55 areas
Internationally recognized research teams
Lab locations :
– Redmond, Washington, USA
– Cambridge, United Kingdom
– Beijing, People’s Republic of China
– Mountain View, California, , USA
– San Francisco, California , USA
75%
10%
10%
5%
1%
6
Microsoft Research
Expanding the State of the Art
• Thousands of peer-reviewed publications
– 10%…30% of papers at our focus conferences
graphics, programming, systems, data management…
• Community leadership
– Professional societies
– Journals
– Conferences
• Mentoring Interns
• Hosting academic summers and sabbaticals
• Special workshops
7
BARC’s Research Agenda
• Scaleable Servers
– TerraServer – US map online
– SkyServer – All astronomy data online
• Databases
– Advancing Databases and data storage
• Media Management
– Organizing your digital shoebox
15
How Can HLS & MSR Cooperate?
• Lots of research at MSR on HLS relevant areas.
–
–
–
–
Data mining and visualization
Distributed systems.
Cryptography, security,…
Etc.,,,
• Invite MS Researchers to HLS
– workshops
– study groups.
• HLS visiting scientists at MSR?
16
Outline
•
•
•
•
•
Overview of Microsoft Research
Big-Database Research
TerraServer: Geospatial app
SkyServer: data mining app
Q&A
17
Numbers
Terabytes and Gigabytes are BIG!
•
•
•
•
Mega – a house in California
Giga – a very rich person (billionaire)
Tera – ~ The national debt
Peta – more than all the money in the world
• A Gigabyte: the Human Genome
• A Terabyte: 150 mile long shelf of books.
18
How much information is there?
Yotta
• Soon everything can be
recorded and indexed
• Most bytes will never be
seen by humans.
• Data summarization,
trend detection
anomaly detection
are key technologies
See Mike Lesk:
How much information is there:
Everything
!
Recorded
All Books
MultiMedia
Zetta
Exa
Peta
All books
(words)
.Movi
e
Tera
Giga
http://www.lesk.com/mlesk/ksg97/ksg.html
See Lyman & Varian:
How much information
http://www.sims.berkeley.edu/research/projects/how-much-info/
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli
A Photo
A Book
Mega
19
Kilo
e-Science
Has BIG DATA
• Data captured by instruments
Or data generated by simulator
• Processed by software
• Placed in a files or database
• Scientist analyzes files / database
• Virtual laboratories
– Networks connecting e-Scientists
– Strong support from funding agencies
• Better use of resources
– Primitive today
20
The eScience Big Picture
Experiments &
Instruments
Other Archives
Literature
questions
facts
facts
?
answers
Simulations
The Big Problems
•
•
•
•
•
•
Data ingest
Managing a petabyte
Common schema
How to organize it?
How to reorganize it
How to coexist with others
• Query and Vis tools
• Support/training
• Performance
– Execute queries in a minute
– Batch query scheduling
21
e-Science is Data Mining
• There are LOTS of data
– people cannot examine most of it.
– Need computers to do analysis.
• Manual or Automatic Exploration
– Manual: person suggests hypothesis,
computer checks hypothesis
– Automatic: Computer suggests hypothesis
person evaluates significance
• Given an arbitrary parameter space:
–
–
–
–
–
–
Data Clusters
Points between Data Clusters
Isolated Data Clusters
Isolated Data Groups
Holes in Data Clusters
Isolated Points
Nichol et al. 2001
Slide courtesy of and adapted from Robert Brunner @ CalTech
.
22
Data Analysis
• Looking for
– Needles in haystacks – the Higgs particle
– Haystacks: Dark matter, Dark energy
• Needles are easier than haystacks
• Global statistics have poor scaling
– Correlation functions are N2, likelihood techniques N3
• As data and computers grow at same rate,
we can only keep up with N logN
• A way out?
– Discard notion of optimal
(data is fuzzy, answers are approximate)
– Don’t assume infinite computational resources or memory
• Requires combination of statistics & computer science
23
Outline
•
•
•
•
•
Overview of Microsoft Research
Big-Database Research
TerraServer: Geospatial app
SkyServer: data mining app
Q&A
24
TerraServer/TerraService
http://terraService.Net/
http://TerraServer-USA.com/
• US Geological Survey Photo
(DOQ) & Topo (DRG) images
• On Internet since June 1998
• Operated by Microsoft
• Cross Indexed with
– Demographics,
• A web service
• 20 TB data source
• 10 M web hits/day
25
USGS Image Data
• Digital OrthoQuads • Urban Area
– 15 TB, 280,000 files
uncompressed
– Digitized aerial
imagery
– 96% coverage
conterminous US
– 1 meter resolution
– < 15 years old
–
–
–
–
–
–
• Digital Raster
Graphics
1 foot resolution
Natural Color
133 major U.S. cities
30 available 2004
2001 or later
Produced by NIMA
for Homeland
Security
– 1 TB compressed
TIFF, 65,000 files
– Scanned topo maps
– 100% U.S. coverage
– 1:24,000, 1:100,000
and 1:250,000 scale
maps
– Maps vary in age
26
Image Coverage
• 100% U.S., Topo Maps
(light green)
2m to 1024m resolution
• 96% 48 Conterminous States,
(dark green)
Ortho Imagery, 1m to 1024m resolution
Urban Area Cities
Seattle, Portland, Stockton, Modesto, Fresno, Sacramento,
Chicago, Orlando, Atlanta, Amarillo, Houston, Lubbock,
Springfield, Birmingham, Dallas, Albuquerque, Oklahoma City,
El Paso, Lincoln, Lexington, Tampa, Washington DC, Mobile
Ft Wayne, Colorado Springs, Baton Rouge, …
27
User Interface Concept
Display Imagery:
316 m 200 x 200 pixel images
7 level image pyramid
Resolution 1 meter/pixel to 64 meter/pixel
Concept: User navigates an
‘almost seamless’ image of
earth
Navigation Tools:
1.5 m place names
“Click-on” Coverage map
Longitude and Latitude search
U.S. Address Search
External Geo-Spatial Links to:
USGS On-line Stream Gauges
Home Advisor Demographics
Home Advisor Real Estate
Encarta Articles
Steam flow gauges
Click on image
to zoom in
Buttons to pan
NW, N, NE, W, E, SW, S, SE
Links to switch between
Topo, Imagery, and Relief data
Links to Print, Download and
view meta-data information
28
New “Urban
Area” Data
Microsoft Campus at 4 meter
resolution
“Redundant Bunch 1”
Ball field at .25 meter
resolution
29
Software Architecture
Load Programs
Database Server
WinForm App
C# Classes
.NET Framework 1.1
TerraServer Stored
Procedures
(T-SQL)
SQL Server 2000
Windows 2003
Server
ADO.NET 1.1
Web Server
TerraServer Web
Pages, Services,
Classes
(C#)
ASP.NET 1.1
.NET Framework 1.1
.NET Framework 1.1
IIS 6.0
IIS 6.0
Windows 2003
Server
Windows 2003
Server
30
TerraServer Becomes a Web Service
TerraServer.net -> TerraService.Net
• Web server is for people.
• Web Service is for programs
– The end of screen scraping
– No faking a URL:
pass real parameters.
– No parsing the answer:
data formatted into your
address space.
• Hundreds of users but a
specific example:
– US Department of Agriculture
Lighthouse app.
– USDA has internal TerraServer
31
Web Service Methods
• Place Search
–
–
–
–
GetPlaceFacts
GetPlaceList
GetPlaceListInRect
CountPlacesInRect
• Projection
–
–
–
–
–
ConvertLonLatPtToUtmPt
ConvertUtmPtToLonLatPt
ConvertLonLatTo NearestPlace
GetTheme
GetLatLonMetrics
• Tile
–
–
–
–
–
–
GetAreaFromPt
GetAreaFromRect
GetAreaFromTileId
GetTileMetaFromLonLatPt
GetTileMetaFromTileId
GetTile (Image)
• Landmark
–
–
–
–
–
GetLandmarkTypes
CountOfLandmarkPointsByRect
GetLandmarkPointsByRect
CountOfLandmarkShapesByRect
GetLandmarkShapesByRect
http://terraservice.net
32
TerraServer Web Services
Terra-Tile-Service
• Get image meta-data
• Query TS Gazetteer
• Retrieve TS ImageTiles
• Projection conversions
Landmark-Service
• Geo-coded data of wellknown objects (points),
e.g. Schools, Golf
Courses, Hospitals, etc.
• Polygons of well-known
objects (shapes), e.g.
Zip Codes, Cities, etc
Sample Apps
• Web Map Client
– OpenGIS “like”
– Landmarks layered on
TerraServer imagery
• Fat Map Client
– Visual Basic / C#
Windows Form
– Access Web Services for
all data
http://terraservice.net
33
Hardware Evolution
• 1998 – 2000: DEC Alpha 8400, StorageWorks DAS
– 1 x 8 x 440mhz RISC processor, 2gb RAM
– 2.5 TB RAID-5, 9gb SCSI drives 7 racks
– $2.1m (World’s Largest PC) – “Single Server Scale Up”
• 2000 – 2003: 4-node Compaq Windows 2000 DataCenter Cluster,
StorageWorks SAN
– 4 x 8 x 700mhz Intel (Xeon) Processor, 4 gb RAM each
– 18 TB RAID-10 (triple mirrored) 73gb drives, 4 racks
– $1.6m – “High Availability Large Scale Cluster”
•
2004 - …: “White-box Storage Bricks”
– Low Cost Availability
• 4 copies of the data
– RAID1 SATA Mirroring
– 2 redundant “Bunches”
• Spare brick to repair failed brick
2N+1 design
• Web Application “bunch aware”
KVM / IP
– Load balances between redundant databases
– Fails over to surviving database on failure
– ~100K$ capital expense.
37
Outline
•
•
•
•
•
Overview of Microsoft Research
Big-Database Research
TerraServer: Geospatial app
SkyServer: data mining app
Q&A
38
Virtual Observatory
http://www.astro.caltech.edu/nvoconf/
http://www.voforum.org/
• Premise: Most data is (or could be online)
• So, the Internet is the world’s best telescope:
–
–
–
–
It has data on every part of the sky
In every measured spectral band: optical, x-ray, radio..
As deep as the best instruments (2 years ago).
It is up when you are up.
The “seeing” is always great
(no working at night, no clouds no moons no..).
– It’s a smart telescope:
links objects and data to literature on them.
39
Why Astronomy Data?
IRAS 25m
•It has no commercial value
–No privacy concerns
–Can freely share results with others
–Great for experimenting with algorithms
2MASS 2m
•It is real and well documented
–High-dimensional data (with confidence intervals)
–Spatial data
–Temporal data
•Many different instruments from
many different places and
many different times
•Federation is a goal
•The questions are interesting
IRAS 100m
WENSS 92cm
NVSS 20cm
–How did the universe form?
•There is a lot of it (petabytes)
DSS Optical
40
ROSAT ~keV
GB 6cm
Time and Spectral Dimensions
The Multiwavelength Crab Nebulae
Crab star
1053 AD
X-ray,
optical,
infrared, and
radio
views of the nearby
Crab Nebula, which is
now in a state of
chaotic expansion after
a supernova explosion
first sighted in 1054
A.D. by Chinese
Astronomers.
41
Slide courtesy of Robert Brunner @ CalTech.
SkyServer.SDSS.org
• A modern archive
– Raw Pixel data lives in file servers
– Catalog data (derived objects) lives in Database
– Online query to any and all
• Also used for education
– 150 hours of online Astronomy
– Implicitly teaches data analysis
• Interesting things
–
–
–
–
–
–
Spatial data search
Client query interface via Java Applet
Query interface via Emacs
Popular -- 1% of Terraserver 
Cloned by other surveys (a template design)
Web services are core of it.
42
Demo of SkyServer
•
•
•
•
•
Shows standard web server
Pixel/image data
Point and click
Explore one object
Explore sets of objects (data mining)
43
Data Federations of Web Services
• Massive datasets live near their owners:
–
–
–
–
Near the instrument’s software pipeline
Near the applications
Near data knowledge and curation
Super Computer centers become Super Data Centers
• Each Archive publishes a web service
– Schema: documents the data
– Methods on objects (queries)
• Scientists get “personalized” extracts
• Uniform access to multiple ArchivesFederation
– A common global schema
44
Federation: SkyQuery.Net
• Combine 4 archives initially
• Just added 10 more
• Send query to portal,
portal joins data from archives.
• Problem: want to do multi-step data analysis
(not just single query).
• Solution: Allow personal databases on portal
• Problem: some queries are monsters
• Solution: “batch schedule” on portal server,
Deposits answer in personal database.
45
SkyQuery Structure
• Each SkyNode publishes
– Schema Web Service
– Database Web Service
• Portal is
– Plans Query (2 phase)
– Integrates answers
– Is itself a web service
Image
Cutout
SDSS
INT
SkyQuery
Portal
FIRST
2MASS
46
SkyQuery: http://skyquery.net/
• Distributed Query tool using a set of web services
• Four astronomy archives from
Pasadena, Chicago, Baltimore, Cambridge (England).
• Feasibility study, built in 6 weeks
– Tanu Malik (JHU CS grad student)
– Tamas Budavari (JHU astro postdoc)
– With help from Szalay, Thakar, Gray
• Implemented in C# and .NET
• Allows queries like:
SELECT o.objId, o.r, o.type, t.objId
FROM SDSS:PhotoPrimary o,
TWOMASS:PhotoPrimary t
WHERE XMATCH(o,t)<3.5
AND AREA(181.3,-0.76,6.5)
AND o.type=3 and (o.I - t.m_j)>2
47
SkyNode Basic Web Services
• Metadata information about resources
– Waveband
– Sky coverage
– Translation of names to universal dictionary (UCD)
• Simple search patterns on the resources
– Cone Search
– Image mosaic
– Unit conversions
• Simple filtering, counting, histogramming
• On-the-fly recalibrations
48
Portals: Higher Level Services
• Built on Atomic Services
• Perform more complex tasks
• Examples
–
–
–
–
–
Automated resource discovery
Cross-identifications
Photometric redshifts
Outlier detections
Visualization facilities
• Goal:
– Build custom portals in days from existing building blocks
(like today in IRAF or IDL)
49
MyDB added to SkyQuery
• Moves analysis to the data
• Users can cooperate
(share MyDB)
• Still exploring this
• Let users add personal DB
1GB for now.
• Use it as a workbook.
• Online and batch queries.
INT
Image
Cutout
SDSS
SkyQuery
Portal
MyDB
FIRST
2MASS
50
The Big Picture
Experiments &
Instruments
Other Archives
Literature
questions
facts
facts
?
answers
Simulations
The Big Problems
•
•
•
•
•
•
Data ingest
Managing a petabyte
Common schema
How to organize it?
How to reorganize it
How to coexist with others
• Query and Vis tools
• Support/training
• Performance
– Execute queries in a minute
– Batch query scheduling
51
Outline
•
•
•
•
•
Overview of Microsoft Research
Big-Database Research
TerraServer: Geospatial app
SkyServer: data mining app
Q&A
52
Grid and Web Services Synergy
• I believe the Grid will be many web services
share data (computrons are free)
• IETF standards Provide
– Naming
– Authorization / Security / Privacy
– Distributed Objects
Discovery, Definition, Invocation, Object Model
– Higher level services: workflow, transactions, DB,..
• Synergy: commercial Internet & Grid tools
53