BioiSdfjldkfg - Cardiff University

Download Report

Transcript BioiSdfjldkfg - Cardiff University

Basic Computing Concepts
for Bioinformatics
Dr Richard White
Basic computing concepts
• “Basic Computing concepts” sounds a bit scary.
I hope you’ll find it isn’t really.
• Actually some of the “Basic Computing
Concepts” you’ll be familiar with already.
2
What does the average biologist
use computers for?
•
•
•
•
Browsing the Web, searching with Google, etc.
Email
Word-processing for reports, etc. (e.g. MS Word)
Data handling and simple statistics (e.g. Excel)
• Playing CDs, games, etc. (no, of course not, only joking…)
• So you’re probably quite experienced in computer
use already
• Databases? The use of bioinformatics databases
figures prominently in this course.
3
What should biologists use
computers for?
• Access to biological databases, especially
those containing bioinformatics information
• Visualisation: ways to understand data better
by visual exploration
• Analysing data, especially to test hypotheses
(to understand biology better)
4
Computer use during this course
• Mostly we’ll be concerned with access to
biological databases
• Also some visualisation sessions and maybe
some data analysis and hypothesis testing
5
Using predefined tools
• You’ll be doing a lot of this work using tools
available on the web.
– This makes life easy, because the hard work of
setting these tools up for use has already been
done by someone else.
• However, sometimes it’s useful to get your
hands dirty and mess about with the data and
ways to process it yourself,
– especially if you want to do something that zillions
of other people haven’t already thought of.
6
Use of databases
• I’ll be running a session on the use of
databases in week 4, but at the moment I want
to think about this in order to discover some
Basic Computing Concepts.
• First, let’s consider the characteristics of
databases for a moment.
7
Simple database concepts
Computers allow the analysis of large data sets.
These are frequently arranged as twodimensional data tables, based on the
convention that
– each row holds information on a separate object
(or abstract entity such as a species),
– each column holds information on a particular
property or characteristic of the objects,
– in general there will be a single value in each cell
of the table, representing the value of a specific
characteristic
for one particular object.
8
Spreadsheets
• Data in the form of two-dimensional tables is
frequently analysed using computer
spreadsheet programs such as Microsoft Excel,
especially where the purpose is
–
–
–
–
relatively simple data reorganisation,
summarisation,
statistical testing
report generation.
9
Databases
• It is becoming harder to distinguish between
spreadsheet and database programs.
• Most databases require more than one table: for
example, one table may store data about proteins and
another table stores data about the species these
proteins are found in.
• For more about database systems, see the PowerPoint
presentation (DatabaseIntroduction.ppt) on
my web site (see handout for details).
10
Methods for using databases
• What methods exist to use databases?
• Basically there are several approaches to the
use of databases:
11
Database use 1: direct access to
database tables
• Run your own database on your own computer
(e.g. MS Access)
• Use a program on your PC which gives you
direct access to the tables in the remote
database (client-server database access)
In both cases, you need instructions as to what
the tables are and what they contain, such as
SQL.
12
SQL statements
• SQL (“Structured Query Language”) is a language for
specifying the creation of databases and the updating and
retrieval of information in them. It is general and “portable” –
so that it can be used with a variety of different database
systems without having to learn a new language for each one.
• The language goes far beyond this scope of this course.
Briefly, it can be used to:
– Specify the tables in the database and the fields (columns) they contain
– Make additions and updates to the data in those tables
– Retrieve information from one or more of the tables
13
SQL for data retrieval
• A typical SQL statement for data retrieval would look
something like this:
SELECT <some fields> FROM <table> WHERE
<condition>;
• The condition effectively selects certain rows from the table.
• Thus the result is often a smaller table than the one being
queried.
• Tables can be “joined” together to combine information from
more than one table, for example when extracting a molecular
sequence from one table and the bibliographic details of the
reference to where it was published from another table.
14
Database use 2: predefined
operations
Alternatively, you might have forms and queries already set up
for you, which you can just run in order to perform predefined
kinds of searches. These predefined operations can be made
directly available to you by:
• Browsing a web page, typically containing a form, which
gives you access [NPI] to a database somewhere else. You’ve
done this if you’ve ever bought anything on the Internet.
• Using or even writing a small program (sometimes called a
script to make it seem less scary) to fetch the data for you.
This allows you to process the data in useful ways:
– to search for features you’re interested in,
– to summarise the data in the way you want, or
– to extract data for statistical analysis to test hypotheses.
15
Database use 3: using predefined
operations
The predefined operations may be packaged as CGI programs or
Web Services or in a variety of other ways, but basically you
just send a request to the service, optionally with some
‘parameters’ to specify what you want, and wait for the reply.
The reply may come back, usually,
• in HTML (as a web page containing the data requested) or
• as some other sort of file to be downloaded (i.e. stored on your
PC), either
– in one of a number of formats invented by the data providers,
– in XML, a standard but flexible (and verbose) way to structure a data
file, so that other programs (rather than humans) can process it easily.
16
Overview of NCBI Entrez
In a later session, you’ll be introduced to a number of
bioinformatics databases, but it’s worth spending a
moment looking at a popular way to make use of
some of them, because you will explore this in
Practical 2 in week 4 of this course.
• NCBI web site
• Entrez utilities
17
Brief introduction to Perl programming
(What? In ten minutes??)
This will help you prepare for Practical 2 (the practical part of the
4th week of the course), in which we shall use simple Perl
programs to request data from a bioinformatics information
provider such as NCBI, by connecting with their Entrez
utilities. (Additional Perl tutorial material may be made
available.)
• What is a Perl program? (or “script”)
• How to run one
• How to write one
• What do you need? – See the handout
18
A computer program
A program is a set of instructions to the computer, such as
• Get input from user
• Perform calculation
• Display window
• React to mouse click
These are instructions at a very high level. They need to be
broken down into smaller details. A program consists of
combinations of:
• Sequences of instructions (statements)
• Repetitions (to execute statements repeatedly)
• Selections (to choose which statements to execute)
• Functions (subroutines or methods: groups of instructions)
19
A simple program
• Here is a simple Perl program.
#!/usr/local/bin/perl
# Program to do the obvious
print 'Hello world.';
• The first line: every Perl program starts off with this as its
very first line, although it may vary from system to system, or
not be used at all. It tells the machine what to do with the file
when it is executed (it tells it to run the file through the Perl
software to execute it).
• Everything which is not a comment is a Perl statement which
must end with a semicolon, like the last line above.
• So the next thing to do is to run it.
20
Running the program
• Type in the example program using a text editor, and
save it in a file called something.pl.
• Now to run the program just type the following at the
Command Prompt.
perl something.pl
• If something goes wrong then you may get error
messages, or you may get nothing at all.
21
Perl programming concepts: variables
Variables can hold both strings and numbers. For
example, the statement
$priority = 9;
sets the scalar variable $priority to 9, but you can
also assign a string to exactly the same variable:
$priority = 'high';
• In general variable names consists of numbers, letters
and underscores, but they should not start with a
number. Perl is case sensitive, so $a and $A are
different variables.
22
Operations and Assignment
Perl uses all the usual arithmetic operators:
$a
$a
$a
$a
=
=
=
=
1
3
5
7
+
*
/
2;
4;
6;
8;
#
#
#
#
Add 1 and 2 and store in $a
Subtract 4 from 3 and store in $a
Multiply 5 and 6
Divide 7 by 8 to give 0.875
etc.
and for strings Perl has the following among others:
$a = $b . $c; # Concatenate $b and $c
23
Array variables
A slightly more interesting kind of variable is the array variable
which is a list of scalars (single values, i.e. numbers and
strings). Array variables have the same format as scalar
variables except that they are prefixed by an @ symbol. The
statement
@food
= ("apples", "pears", "eels");
assigns a three element list to the array variable @food.
The array is accessed by using indices starting from 0, and square
brackets are used to specify the index. The expression
$food[2]
returns eels. Notice that the @ has changed to a $ because
$food[2] and eels are scalars, not arrays.
24
File handling
Here is a basic Perl program which does the same as the UNIX
cat or Dos/Windows type command on a certain file.
#!/usr/local/bin/perl
# Program to open the password file, read it in,
# print it, and close it again.
$file = '/etc/passwd'; # Name the file
open(INFO, $file);
# Open the file
@lines = <INFO>;
# Read it into an array
close(INFO);
# Close the file
print @lines;
# Print the array
25
Control structures
Perl supports lots of different kinds of control structures.
Have a look at the Perl resources listed on the
handout. Most Perl programs use these features.
• Programs can make choose between alternative
branches
• Programs can repeat statements until something
happens
• Frequently used statements to carry out some
common task can be made into a “subroutine” or
“function” and called from others part of the program
26
End
27