Supporting Social Science - Indiana University Bloomington

Download Report

Transcript Supporting Social Science - Indiana University Bloomington

Cyberinfrastructure
Supporting Social Science
Cyberinfrastructure Workshop
October 16 2012 Chicago
Geoffrey Fox
[email protected]
Informatics, Computing and Physics
Indiana University Bloomington
https://portal.futuregrid.org
Goal of Day
• Come up with a few (3-5) projects that
advance Social Sciences Cyberinfrastructure
• Choose so that together they cover spectrum
of characteristics
Characteristics
A
Project 1
Project 2
B
X
C
….
X
X
Z
X
X
X
…..
Project N
X
X
https://portal.futuregrid.org
2
Data Type
• What is large? #Collections v. Collection Size v. #Users
• “Big (Social) Science” v Long Tail
• # rows v # columns v time dependence
• Structured (defined) v unstructured (inferred/discovered) metadata
• granularity of metadata
• Data modality: Streaming, video, image, text, “binary”
– vector space or not (genomics, network)
• distributed v centralized data (production/storage/processing)
• Complex objects v. tables
• Observed v. simulation or modeling
https://portal.futuregrid.org
3
Data Nature (“ilities”)
• Open data
• Sharable Data
• Publication model / Data citation models?
– DOI or Handler
•
•
•
•
•
•
•
Reproducibility
Sustainability
Standards
Management
Integration
Dramatic change in next 10 years
Data availability as in Public Windy Grid
https://portal.futuregrid.org
4
Mining/Analyzing data
• Access: role of Community comments, crowd sourcing,
• Processing: “Simple” statistics, Linkage software, data
visualization, GIS, analytics (SVM, LDA, Clustering ...); (new)
management tools
• Data Mining (discovering the unexpected) v. Data Analysis
(discovering with excellence the ~expected)
• Modeling for data components and regression
• More data v more/better algorithms (in simulation, algorithm
advances ~ as important as machine advances)
• Programming model: Excel, SQL, R, SPSS, Other Scripting,
MapReduce, "Fortran/C++/Java", Libraries, workflow,
portal/gateway
• Open software & sustainability of it
https://portal.futuregrid.org
5
Security & Privacy
•
•
•
•
•
•
•
Support sharing
The law
Risk of identification, harm from disclosure
Differential Privacy and nifty obfuscation ideas
IRB
Federated Identity
Enclave
https://portal.futuregrid.org
6
•
•
•
•
•
•
•
•
•
•
•
The Infrastructure
Repository/Archive v. Active (compute + storage) data
Bring Computing to data
Commercial Clouds v. XSEDE v. University
Local v. cloud v. department/university
Distributed (Federated) clouds as collections distributed
DropBox, Google docs, Skype etc. v customized
Generality of DuraCloud, Dataverse DataUp etc.
Tool repository/library
Cloudbursting (public-private hybrid cloud)
Connectivity to cloud (can be addressed by I2?)
Backup v Main Home
https://portal.futuregrid.org
7
Other Characteristics
•
•
•
•
•
•
•
•
Satisfying NSF Data Management requirements
Breadth of applicability of solutions
# Organizations collaborating on project
Interdisciplinary collaborations
Data (science) Curricula
Relation to issues in other fields
Support and Governance
Industry ahead of Academia
https://portal.futuregrid.org
8