Giles_Talk - NSF PI Meeting | The Science of Cloud Computing

Download Report

Transcript Giles_Talk - NSF PI Meeting | The Science of Cloud Computing

Research Issues for Large Scale Digital
Library Search Engines
in the Cloud: CiteSeerX
or
Why consider CiteSeerX as a Cloud Testbed
C. Lee Giles, Pradeep Teregowda, Bhuvan Urgaonkar
Pennsylvania State University
University Park, PA
Data Varies with Discipline
or Small vs Big Science
Small vs Big science
“Data from Big Science is … easier to handle, understand and archive.
Small Science is horribly heterogeneous and far more vast. In time
Small Science will generate 2-3 times more data than Big Science.”
‘Lost in a Sea of Science Data’ S.Carlson, The Chronicle of Higher
Education (23/06/2006)
Data is local
Data will not be shared
At some point there will be needed “local” clouds
If you can’t move the data around,
•Bandwidth of a van loaded with disks
take the analysis/cloud to the data!
Do all/most data manipulations locally
clouds for digital libraries/search engines
Several features attractive for information retrieval
systems such as digital libraries and search engines
(storage and fast access)
Flexibility/growth
•
Components such as crawlers, web interfaces, etc. can
utilize resources on demand.
Management
•
Utilizing cloud services potentially requires less investment
in hardware and maintenance.
•
By deploying across sites (or adopting solutions distribution
services provided by vendors), systems are potentially more
stable.
Reliability
What about CiteSeerX?
SeerSuite -
x
CiteSeer
SeerSuite
 Framework for digital libraries
 Flexible, scalable, robust, portable, state of the art machine
learning extractors, open source.
 Easy to create instances of SeerSuite both production
and research grade:
 CiteSeerx: computer science
 ChemXSeer: chemistry
 ArchSeer: archaeology
 CollabSeer
 EnronSeer
 YouSeer
 Facilitates research
http://citeseerx.ist.psu.edu
How CiteSeerX is like a specialty search
engine
CiteSeerX shares several components with digital
libraries and search engines
Web Interface
•
Both digital libraries and search engines provide interfaces to
users to interact with the application
•
Focused crawling of the web for scholarly/academic documents
•
Both digital libraries and search engines utilize an inverted index to
provide efficient and fast access to users through search.
•
Digital libraries maintain extensive metadata usually in relational
databases.
Crawlers
Index
Databases
CiteSeerX as a testbed


Digital Libraries continue to grow and be widely used

Cyberinfrastructure for scientists and academics

Google Scholar is very popular & to some invaluable

Publisher collections: ACM portal, Scopus, etc.; Library of Congress (NDLP)

DLs are usually poorly supported and have few monetization models

CiteSeerX is a digital library and a search engine
Features of CiteSeerX


Automatic acquisition of new documents by focused web crawling (1.5M documents, 20M
citations – 2TB, 1-4 M authors) (data regularly shared by rsync), 24/7 service
Interface for search widely used (2 M hits/day, 200K queries/day)

Full text indexing

Autonomous citation indexing, linking documents through citations.

Automatic metadata extraction for each document.

MyCiteSeer for personalization.

New features in development, e.g.


Table extraction and search

Algorithm extraction and search
Commercial grade open source code and data shared
4 systems:
• Production
• Crawling
• Staging
• Research
All or some
can be
cloudized
Collection of Research Issues

Hosting cloud CiteSeerX instances


Economic issues

Cost of hosting

Cost of refactoring the source to be hosted in the cloud.
Computational/technical issues

What workflow to cloudize

Component modification for efficient operation


VM size: storage, memory and CPU sizing as a function of
needs

Establishing computational needs and availability clusters

Appropriate load balancing across multiple sites.

Security of data stored including metadata and user data.
Policy issues


Privacy of user data
Copyright issues.
CiteSeerX Architecture
USENIX ‘10
 Web Application
 Focused Crawler
 Document
Conversion and
Extraction
 Document Ingestion
 Data Storage
 Maintenance Services
 Federated Services
CiteSeerX data transfer
 Nodes for hosting
IEEE Cloud ‘10
Hosting models for DLs
Component hosting
USENIX Hotcloud ‘10, IEEE Cloud ‘10
 SeerSuite is modular by design andHot
architecture; host
individual components across available infrastructure.
Content hosting
 CiteSeerx provides access to document metadata,
copies and application content
 Host parts or complete set.
Peak load loading
 Support the application during peak loads
 Support growth of traffic.
Focus on actual public cloud costs
 Google APP, EC2 estimates
Component Hosting
 Expense of hosting the whole of CiteSeerx maybe
prohibitive.
 Solution: Host a component or service i.e.,
 Component/service code
 Data on which the component acts
 Interfaces, etc. associated with the component
 Goal: Identify optimal subset/components based on:
 Service growth
 Service usage
 New services
Component Hosting - Costs
Component
Amazon EC2
Initial
Google App Engine
Monthly Initial
Costs
Monthly
Costs
Web
Services
0
1448.18
0
942.53
Repository
0
1000
163.8
593.21
Database
0
858.89
12
348.05
Index
0
527.08
3.1
83.48
Extraction
0
499.02
0
90.6
Crawler
0
513.4
0
105
 Least expensive option - host the index for cases.
 Most expensive - host web services.
Component Hosting – Lessons Learned
Hosting components is reasonable
 Having a service oriented architecture helps
Amazon EC2
 Computation costs dominate.
Google App Engine
 Refactoring costs ?
Refactoring required not just for components, but other services.
Storage and transfer costs maybe optimized
 A study of data transfer in the application gives insights to costs.
Approach suitable for meeting fixed budgets
 How many components of an application can be hosted for a fixed budget.
Content Hosting – Lessons Learned
Hosting specific content relevant to peak load
scenarios
 Easy to do – minimal refactoring required, affects a
minimal set of components (presentation layer).
More complex scenarios need to be examined
 Hosting papers from the repository
 Hosting shards of the index
 Database
Peak Load – Lessons Learned
Hosting only during peak load conditions is
economically feasible.
Growth potential
 Can be used to handle growth in traffic, instead of
procuring new hardware.
 Hosting a specific component under stress; such as a
database
 In such a case it will cost 400$ to host the database in
Amazon EC2.
Research Directions
•
•



Similar to many discussed at this work shop
applied to DLs
Explore policies for hosting based on

Privacy/Security

Integrity/reliability

QoS; $ Costs

Local vs public
Architecture redesign utilizing cloud primitives and
systems spanning multiple sites
Queues, Key Stores, Clusters
Optimization of existing features for automated
VM-ing
Conclusions
Advantages of cloudizing CiteSeerx
 Reliability, maintenance, potential costs savings
 Different costs of hosting for all or parts of
 Components
 Content
 Peak load
 CiteSeerX working system – testbed?
 Data, storage, access, databases
 Growth, evolving features, users
 $ savings depend on continued support; working
local system may still be needed
 Archival issues/ Google Scholar
NSF cloud focus?
Besides what has been proposed at this workshop
 Clouds for science – what industry will not or can not
support
 Both for big and small science
 Clouds for the “new” sciences – social, political,
historical,… that have growing amounts of data
 Also focus on data:
 Data rules, Without data, there is no science.
Future Work
 Cost of refactoring – particularly for Google App
Engine.
 Cost comparisons for other cloud offerings – Azure,
Eucalyptus.
 Privacy and user issues – myCiteSeer and private
clouds.
 Technical issues with cross hosting – load balancing,
latency needed to be addressed.
 Virtualization in SeerSuite, components built with cloud
hosting in mind (Federated Services).
References
GROSSMAN , R., AND G U , Y. Data mining using high performance data clouds: Experimental studies using sector
and sphere. In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data
mining (2008), ACM,pp. 920–927.
MIKA , P., AND T UMMARELLO , G. Web semantics in the clouds. IEEE Intelligent Systems 23, 5 (2008), 82–87.
NURMI , D., WOLSKI , R., G RZEGORCZYK , C., O BERTELLI , G., S OMAN , S., YOUSEFF , L., AND ZAGORODNOV, D.
The eucalyptus open-source cloud-computing system. In Proceedings of the 2009 9th IEEE/ACM International
Symposium on Cluster Computing and the Grid-Volume 00 (2009), IEEE Computer Society, pp. 124–131.
SINGH , A., S RIVATSA , M., AND L IU , L. Search-as-a-service: Outsourced search over outsourced storage. ACM
Trans. Web 3, 4 (2009), 1–33.
TEREGOWDA , P. URGAONKAR , B., AND GILES , C. Cloud computing: A digital libraries perspective. In 3rd IEEE
2010 International Conference on Cloud Computing (2010).
TEREGOWDA , P. B., COUNCILL , I. G., FERNANDEZ , J. P. R., KASBHA , M., ZHENG , S., AND GILES , L. C. Seersuite:
Developing a scalable and reliable application framework for building digital libraries by crawling the web. In
USENIX Conference on Web Application Development (2010).
P.B Teregowda, B. Urgaonkar, C.L. Giles, "Cost Implications Of Moving To The Cloud: A Digital Libraries
Perspective," 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud '10), 2010.
VAN DE SOMPEL , H., NELSON , M., LAGOZE , C., AND WARNER , S. Resource harvesting within the OAI-PMH
framework. D-Lib Magazine 10, 12 (2004), 1082–9873.
WALKER , E., BRISKEN , W., AND ROMNEY, J. To lease or not to lease from storage clouds. Computer 43 (2010), 44–
50.
WEIGEL , F., PANDA , B., RIEDEWALD , M., GEHRKE , J., AND CALIMLIM , M. Large-scale collaborative analysis and
extraction of web data. Proc. VLDB Endow. 1, 2 (2008), 1476–1479.
WOOD , T., CECCHET, E., RAMAKRISHNANY, K., SHENOY, P., VAN DER MERWEY, J., AND V ENKATARAMANI , A.
Disaster recovery as a cloud service: Economic benefits & deployment challenges. In 2nd USENIX Workshop on
Hot Topics in Cloud Computing (2010).