Building a Site: Publicising

Download Report

Transcript Building a Site: Publicising

IDK0040 Võrgurakendused I
Building a site: Publicising
Deniss Kumlander
How to make a site really public
or
search engines’ algorithms
Targets
• We are going to check
– how search services are built
– how to ensure that your site is included into
the search engines DB
– how to improve a site's chances of being
selected by a search engine in response to a
query string
High level description
scans
Crawler
Database of
URLs, Ranks, Relations etc
Searchers
processor
High level description
• Notice that search engines creates either
general level description of the web as an
answer to a searched string
• or a private view
• Anyway your are not searching Internet, but an
index created by a “search engine”
– Simply storing one billion pages of 10 kbytes each (compressed) requires 10TB
and another 10TB for indexes
– Moreover a public search engine requires much more resources than to calculate
query results and to provide high availability.
– Crawling 1B pages with 10 machines crawling at 100 pages/second would take
1M seconds, or 11.6 days on a very high capacity Internet connection.
Adding a site
• Wait if any already indexed site is refering to
your one
– The Web is growing much faster than any present-technology
search engine can possibly index (see distributed web crawling).
In 2006, some users found major search-engines became slower
to index new webpages
• Use an “Add my URL” function in many search
engines
– a website developer has to be more proactive than ever before
about getting listed by search engines and directories. In many
cases, this means (unfortunately) that you have to pay a fee to
get listed.
Searching process
• All words are ranked by the prevalence of
words in standard X language
– “Rude” is a more important word that “all”
– Common words (”and”, “or”) are thrown away
if you are not looking an exact phrase
Standard indexing process
• <title> is important element to look by
• <meta name=“keywords”…>
• <meta name=“description”…> (is not used by
Google)
• Headlines
– <h1>
– <h2>
–…
Prohibit to discover
• robots.txt
– http://www.robotstxt.org/wc/norobots.html
• Security restrictions
Improving the site rank
by Google
•
Have other relevant sites link to yours.
– Make sure all the sites that should know about your pages are aware your site is
online.
– Submit your site to relevant directories such as the Open Directory Project and
Yahoo!, as well as to other industry-specific expert sites
•
•
•
•
•
•
•
Make a site with a clear hierarchy and text links. Every page should be
reachable from at least one static text link.
Offer a site map to your users with links that point to the important parts of your
site.
Create a useful, information-rich site, and write pages that clearly and
accurately describe your content.
Think about the words users would type to find your pages, and make sure that
your site actually includes those words within it.
Try to use text instead of images to display important names, content, or links.
Check for correct HTML, format etc
If you decide to use dynamic pages (i.e., the URL contains a "?" character), be
aware that not every search engine spider crawls dynamic pages as well as
static pages. Keep the links on a given page to a reasonable number (fewer
than 100).
Improving the site rank
• Make use of the robots.txt file on your web server. This file tells
crawlers which directories can or cannot be crawled. Visit
http://www.robotstxt.org/wc/faq.html to learn how to instruct robots
when they visit your site.
– Use Google Sitemaps
• Don't use "&id=" as a parameter in your URLs
• Provide high-quality content on your pages, especially your
homepage. This is the single most important thing to do. If your
pages contain useful information, their content will attract many
visitors and entice webmasters to link to your site
– As links says how important your site and what your site is about
Improving the site rank: be honest
basic principles
• Make pages for users, not for search engines. Don't deceive your users or present
different content to search engines than you display to users, which is commonly referred
to as "cloaking."
• Avoid tricks intended to improve search engine rankings. A good rule of thumb is whether
you'd feel comfortable explaining what you've done to a website that competes with you.
• Don't participate in link schemes designed to increase your site's ranking or PageRank. In
particular, avoid links to web spammers or "bad neighborhoods" on the web, as your own
ranking may be affected adversely by those links.
• Don't use unauthorized computer programs to submit pages, check rankings, etc.
Some specific guidelines
• Avoid hidden text or hidden links.
• Don't send automated queries to Google.
• Don't load pages with irrelevant words.
• Don't create multiple pages, subdomains, or domains with substantially duplicate content.
• Don't create pages that install viruses, trojans, or other badware.
• If your site participates in an affiliate program, make sure that your site adds value.
Provide unique and relevant content that gives users a reason to visit your site first.
Final notes
• Keep structure of the site static
– It takes a sufficient time for search engines to
re-visit your site especially if it is not topranked
– Other sites starting to refer to your one could
produce the “Page not found” error