Transcript Document

CSC 4630
Meeting 9
February 14, 2007
Valentine’s Day; Snow Day
Last of awk
• Quick review of scripting languages, and
more generally, programming languages
– Built-in variables
– Variable typing
– Implicit control structure of program
– Assignment statements and operations
– Control structures
Next Week and Next Next Week
• Exam 1: Monday, February 26
• Project 2: Wednesday, February 28
Last of awk (2)
•
•
•
•
Control structures
Arrays
Formatted printing
Subtleties and intricacies
Control Structures
• if (<expression>) <s1> else <s2>
<expression> can be any expression; true is
defined to be non-zero or non-null
<s1> and <s2> can be any group of
statements
Note the critical parentheses that separate
the conditional expression from <s1>
Control Structures (2)
• while (<expression>) <s1>
Same rules as for if-then-else
Control Structures (3)
• for (<e1>;<e2>;<e3>) <s1> is equivalent to
<e1>; while (<e2>) {<s1>;<e3>}
• <e1> initializes the loop variable
• <e2> checks the loop variable for termination
• <e3> changes the value of the loop variable
• for (k in <array>) <s1> loops over the subscripts
of an array but the order of the subscripts is
random. Careful: awk allows general
subscripting. Strings can be used as subscripts.
Control Structures (4)
“Go to” structures
• break when executed within a for or while
statement, causes an immediate exit
• continue when executed within a for or while
statement, causes immediate execution of the
next iteration
• next causes the next line (record) of the input
file to be read and the sequence of
pattern {action} statements executed on it
• exit causes the program to jump to the END
pattern, execute it, and stop
Practice Time
• We’ll use pair programming
– Pair up by twos
– One person is in control of the keyboard
– Sketch the features of the program
– Test as you go
awk Practice: Example 1
Input: A file containing syntactically correct
North American telephone numbers in the
form XXX-XXX-XXXX
Output: A file containing the numbers from
the input file formatted as international
numbers, namely +1.XXX.XXX.XXXX
Test file: Create your own
awk Practice: Example 2
Input: A file, each line of which supposedly
contains a North American style telephone
number
Output: The input file cleaned of bad numbers,
inappropriate lines, and empty lines. Each
correct number formatted as XXX-XXX-XXXX
Test Input: /mnt/a/beck/samples/phonenumbers
Notes: Program must handle arbitrary input files
Start simple, add features as you
investigate
awk Practice: Example 3
Input: A file in the same form as for
Example 2.
Output: The input file cleaned and correct
numbers formatted in international format,
+1.xxx.xxx.xxxx
awk Practice: Example 4
The website flightaware.com gives the departure
and arrival history of commercial airline flights,
among other things. You can easily extract the
history to a text file by cutting and pasting. But
then the file needs to be cleaned and
reformatted to be useful.
Input: A flight history file from flightaware.com, e.g.
/mnt/a/beck/samples/flight1931
Example 4 (2)
Output: Data from the input file involving
one leg of the flight (use PHL to ATL), one
line per day, fields separated by :: . Fields
are date, departure time, arrival time,
elapsed time. Include a header line that
contains the flight number (1931 for the
sample), origin (PHL), and destination
(ATL). Include a second header line that
labels the data columns.
awk Practice: Example 5
Computations involving flight data.
Input: Cleaned flight data file (the output file
from Example 4)
Output: Earliest and latest departure,
earliest and latest arrival, shortest elapsed
time, longest elapsed time, average
elapsed time.
Notes: Programs from Examples 4 and 5
should work with any set of flight data.
awk Practice: Example 6
DNA to protein translation
– In the computational biology world it is wellknown that each triple of bases along a DNA
segment translates to one of the 20 amino
acids, which are the building blocks for
proteins.
Input: A DNA sequence
Output: The corresponding amino acid
sequence
Project 2
• Due, Wednesday, February 28
• Part 1
– Implement an improved version of mobilex
entirely in awk. The program should take a
file containing a chapter of the text and return
the lexicon with frequency counts sorted in
decreasing order of frequency.
Project 2 (2)
– Notes on Part 1
• Include one title line giving chapter number and
title
• All trailing punctuation should be removed
• All initial capitalization should be removed
• No numbers in lexicon
• Compound words should be retained
– Desirable features
• Remove contractions and spell them out
• Remove possessive constructions. The ‘s should
not be counted as a different word.
• Retain capitalized proper names
Project 2 (3)
• Part 2
– Add summary statistics to the mobylex
program that give
• Total number of words in chapter
• Number of different words in chapter
• Average word length (number of characters) (taken
over distinct words)
• Maximum word length