Apartment Cloud

Download Report

Transcript Apartment Cloud

Apartment Cloud
Noah Callaway
Zak Nelson
Zac Fleischmann
Brandon Zahl
Aspirations / Reality
Aggregate apartments listings from all across the
internet to create a…
…simple, one-stop, apartment search
Aggregate apartment listings from top sites.
(Washington state only)
…mostly one-stop apartment search.
…mostly simple.
Building It
Brandon – Site specific extractors
Statistics
Noah –
Server configuration
Front-end development
Zac –
Site specific extractors
Advanced Search
Zak –
Crawler / Aggregator
Commute distance feature
Page Extraction Statistics
Extractor Name
Files Crawled
Listings
Found
Extraction
Errors
% errorfree
Rent.com
4907
ApartmentRatings.com 7855
325
723
0
11
100
98.4
Craigslist.com
9794
2773
91
96.6
MyNewPlace.com
7392
901
70
91.6
Extraction Accuracy Statistics
Extractor Name
Rent.com
TP
TN
FP
FN
Precision Recall F-score
281
135
12
0
0.959 1.000
0.979
ApartmentRatings.com
39
0
1
0
0.975 1.000
0.987
Craigslist
63
147
3
9
0.955 0.875
0.913
186
186
10
44
0.949 0.809
0.873
MyNewPlace.com
Experiment Conclusion
• Much higher accuracy on the structured pages
versus unstructured craigslist
• Craigslist is candidate for machine learning
• Machine learning likely worse on others
What we learned
•How to configure Amazon Web Services with a
LAMP stack
•How to create a web application with AJAX
•How to use Jobo and Nutch for web crawling
•How to parse HTML for pertinent data
•The considerations of starting a web business
Unexpected Outcomes
•Amazon Web Services was slower than a
$7/month virtual server
•Most of the large listing sites were surprisingly
easy to extract data from
•Aggregating information from the web is legally
tricky
Things We’d Do Differently
•Better version control
•More pre-coding design
•More quality control and testing
•More extensible extractors (Maybe an existing
HTML parser)
Demo