Giles_Talk - NSF PI Meeting | The Science of Cloud Computing
Download
Report
Transcript Giles_Talk - NSF PI Meeting | The Science of Cloud Computing
Research Issues for Large Scale Digital
Library Search Engines
in the Cloud: CiteSeerX
or
Why consider CiteSeerX as a Cloud Testbed
C. Lee Giles, Pradeep Teregowda, Bhuvan Urgaonkar
Pennsylvania State University
University Park, PA
Data Varies with Discipline
or Small vs Big Science
Small vs Big science
“Data from Big Science is … easier to handle, understand and archive.
Small Science is horribly heterogeneous and far more vast. In time
Small Science will generate 2-3 times more data than Big Science.”
‘Lost in a Sea of Science Data’ S.Carlson, The Chronicle of Higher
Education (23/06/2006)
Data is local
Data will not be shared
At some point there will be needed “local” clouds
If you can’t move the data around,
•Bandwidth of a van loaded with disks
take the analysis/cloud to the data!
Do all/most data manipulations locally
clouds for digital libraries/search engines
Several features attractive for information retrieval
systems such as digital libraries and search engines
(storage and fast access)
Flexibility/growth
•
Components such as crawlers, web interfaces, etc. can
utilize resources on demand.
Management
•
Utilizing cloud services potentially requires less investment
in hardware and maintenance.
•
By deploying across sites (or adopting solutions distribution
services provided by vendors), systems are potentially more
stable.
Reliability
What about CiteSeerX?
SeerSuite -
x
CiteSeer
SeerSuite
Framework for digital libraries
Flexible, scalable, robust, portable, state of the art machine
learning extractors, open source.
Easy to create instances of SeerSuite both production
and research grade:
CiteSeerx: computer science
ChemXSeer: chemistry
ArchSeer: archaeology
CollabSeer
EnronSeer
YouSeer
Facilitates research
http://citeseerx.ist.psu.edu
How CiteSeerX is like a specialty search
engine
CiteSeerX shares several components with digital
libraries and search engines
Web Interface
•
Both digital libraries and search engines provide interfaces to
users to interact with the application
•
Focused crawling of the web for scholarly/academic documents
•
Both digital libraries and search engines utilize an inverted index to
provide efficient and fast access to users through search.
•
Digital libraries maintain extensive metadata usually in relational
databases.
Crawlers
Index
Databases
CiteSeerX as a testbed
Digital Libraries continue to grow and be widely used
Cyberinfrastructure for scientists and academics
Google Scholar is very popular & to some invaluable
Publisher collections: ACM portal, Scopus, etc.; Library of Congress (NDLP)
DLs are usually poorly supported and have few monetization models
CiteSeerX is a digital library and a search engine
Features of CiteSeerX
Automatic acquisition of new documents by focused web crawling (1.5M documents, 20M
citations – 2TB, 1-4 M authors) (data regularly shared by rsync), 24/7 service
Interface for search widely used (2 M hits/day, 200K queries/day)
Full text indexing
Autonomous citation indexing, linking documents through citations.
Automatic metadata extraction for each document.
MyCiteSeer for personalization.
New features in development, e.g.
Table extraction and search
Algorithm extraction and search
Commercial grade open source code and data shared
4 systems:
• Production
• Crawling
• Staging
• Research
All or some
can be
cloudized
Collection of Research Issues
Hosting cloud CiteSeerX instances
Economic issues
Cost of hosting
Cost of refactoring the source to be hosted in the cloud.
Computational/technical issues
What workflow to cloudize
Component modification for efficient operation
VM size: storage, memory and CPU sizing as a function of
needs
Establishing computational needs and availability clusters
Appropriate load balancing across multiple sites.
Security of data stored including metadata and user data.
Policy issues
Privacy of user data
Copyright issues.
CiteSeerX Architecture
USENIX ‘10
Web Application
Focused Crawler
Document
Conversion and
Extraction
Document Ingestion
Data Storage
Maintenance Services
Federated Services
CiteSeerX data transfer
Nodes for hosting
IEEE Cloud ‘10
Hosting models for DLs
Component hosting
USENIX Hotcloud ‘10, IEEE Cloud ‘10
SeerSuite is modular by design andHot
architecture; host
individual components across available infrastructure.
Content hosting
CiteSeerx provides access to document metadata,
copies and application content
Host parts or complete set.
Peak load loading
Support the application during peak loads
Support growth of traffic.
Focus on actual public cloud costs
Google APP, EC2 estimates
Component Hosting
Expense of hosting the whole of CiteSeerx maybe
prohibitive.
Solution: Host a component or service i.e.,
Component/service code
Data on which the component acts
Interfaces, etc. associated with the component
Goal: Identify optimal subset/components based on:
Service growth
Service usage
New services
Component Hosting - Costs
Component
Amazon EC2
Initial
Google App Engine
Monthly Initial
Costs
Monthly
Costs
Web
Services
0
1448.18
0
942.53
Repository
0
1000
163.8
593.21
Database
0
858.89
12
348.05
Index
0
527.08
3.1
83.48
Extraction
0
499.02
0
90.6
Crawler
0
513.4
0
105
Least expensive option - host the index for cases.
Most expensive - host web services.
Component Hosting – Lessons Learned
Hosting components is reasonable
Having a service oriented architecture helps
Amazon EC2
Computation costs dominate.
Google App Engine
Refactoring costs ?
Refactoring required not just for components, but other services.
Storage and transfer costs maybe optimized
A study of data transfer in the application gives insights to costs.
Approach suitable for meeting fixed budgets
How many components of an application can be hosted for a fixed budget.
Content Hosting – Lessons Learned
Hosting specific content relevant to peak load
scenarios
Easy to do – minimal refactoring required, affects a
minimal set of components (presentation layer).
More complex scenarios need to be examined
Hosting papers from the repository
Hosting shards of the index
Database
Peak Load – Lessons Learned
Hosting only during peak load conditions is
economically feasible.
Growth potential
Can be used to handle growth in traffic, instead of
procuring new hardware.
Hosting a specific component under stress; such as a
database
In such a case it will cost 400$ to host the database in
Amazon EC2.
Research Directions
•
•
Similar to many discussed at this work shop
applied to DLs
Explore policies for hosting based on
Privacy/Security
Integrity/reliability
QoS; $ Costs
Local vs public
Architecture redesign utilizing cloud primitives and
systems spanning multiple sites
Queues, Key Stores, Clusters
Optimization of existing features for automated
VM-ing
Conclusions
Advantages of cloudizing CiteSeerx
Reliability, maintenance, potential costs savings
Different costs of hosting for all or parts of
Components
Content
Peak load
CiteSeerX working system – testbed?
Data, storage, access, databases
Growth, evolving features, users
$ savings depend on continued support; working
local system may still be needed
Archival issues/ Google Scholar
NSF cloud focus?
Besides what has been proposed at this workshop
Clouds for science – what industry will not or can not
support
Both for big and small science
Clouds for the “new” sciences – social, political,
historical,… that have growing amounts of data
Also focus on data:
Data rules, Without data, there is no science.
Future Work
Cost of refactoring – particularly for Google App
Engine.
Cost comparisons for other cloud offerings – Azure,
Eucalyptus.
Privacy and user issues – myCiteSeer and private
clouds.
Technical issues with cross hosting – load balancing,
latency needed to be addressed.
Virtualization in SeerSuite, components built with cloud
hosting in mind (Federated Services).
References
GROSSMAN , R., AND G U , Y. Data mining using high performance data clouds: Experimental studies using sector
and sphere. In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data
mining (2008), ACM,pp. 920–927.
MIKA , P., AND T UMMARELLO , G. Web semantics in the clouds. IEEE Intelligent Systems 23, 5 (2008), 82–87.
NURMI , D., WOLSKI , R., G RZEGORCZYK , C., O BERTELLI , G., S OMAN , S., YOUSEFF , L., AND ZAGORODNOV, D.
The eucalyptus open-source cloud-computing system. In Proceedings of the 2009 9th IEEE/ACM International
Symposium on Cluster Computing and the Grid-Volume 00 (2009), IEEE Computer Society, pp. 124–131.
SINGH , A., S RIVATSA , M., AND L IU , L. Search-as-a-service: Outsourced search over outsourced storage. ACM
Trans. Web 3, 4 (2009), 1–33.
TEREGOWDA , P. URGAONKAR , B., AND GILES , C. Cloud computing: A digital libraries perspective. In 3rd IEEE
2010 International Conference on Cloud Computing (2010).
TEREGOWDA , P. B., COUNCILL , I. G., FERNANDEZ , J. P. R., KASBHA , M., ZHENG , S., AND GILES , L. C. Seersuite:
Developing a scalable and reliable application framework for building digital libraries by crawling the web. In
USENIX Conference on Web Application Development (2010).
P.B Teregowda, B. Urgaonkar, C.L. Giles, "Cost Implications Of Moving To The Cloud: A Digital Libraries
Perspective," 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud '10), 2010.
VAN DE SOMPEL , H., NELSON , M., LAGOZE , C., AND WARNER , S. Resource harvesting within the OAI-PMH
framework. D-Lib Magazine 10, 12 (2004), 1082–9873.
WALKER , E., BRISKEN , W., AND ROMNEY, J. To lease or not to lease from storage clouds. Computer 43 (2010), 44–
50.
WEIGEL , F., PANDA , B., RIEDEWALD , M., GEHRKE , J., AND CALIMLIM , M. Large-scale collaborative analysis and
extraction of web data. Proc. VLDB Endow. 1, 2 (2008), 1476–1479.
WOOD , T., CECCHET, E., RAMAKRISHNANY, K., SHENOY, P., VAN DER MERWEY, J., AND V ENKATARAMANI , A.
Disaster recovery as a cloud service: Economic benefits & deployment challenges. In 2nd USENIX Workshop on
Hot Topics in Cloud Computing (2010).