Text Processing with Perl
Aim:To learn how to do some common text processing tasks using Perl.
Introduction:
Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl
was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to
make report processing easier.
Perl borrows features from other programming languages including C, shell scripting
(sh), AWK, and sed. The language provides powerful text processing facilities without the
arbitrary data length limits of many contemporary Unix tools, facilitating easy manipulation of
text files.
Though originally developed for text manipulation, Perl is used for a wide range of tasks
including system administration, web development, network programming, games,
bioinformatics, and GUI development.
The language is intended to be practical (easy to use, efficient, complete) rather than
beautiful (tiny, elegant, minimal). Its major features include support for multiple programming
paradigms (procedural, object-oriented, and functional styles), reference counting memory
management , built-in support for text processing, and a large collection of third-party modules.
CPAN, the Comprehensive Perl Archive Network, is an archive of over 90,000
modules of software written in Perl, as well as documentation for it. It has a presence on the
World Wide Web at www.cpan.org and is mirrored worldwide at more than 200 locations. CPAN
can denote either the archive network itself, or the Perl program that acts as an interface to the
network and as an automated software installer (somewhat like a package manager). Most
software on CPAN is free and open source software.
This exercise consists of 7 programs of increasing complexity in Perl.
Description:
Students will write seven programs in Perl and test their results. The programs will also
use 3rd party modules to the language. The third party modules will be installed from the
distribution packages rather than through CPAN.
Pre-requisites:
Perl is installed by default in all Linux distributions. So the students can start
programming with any text editor of their choice. When a program requires a third-party module
and support files it will be mentioned with instructions on how to install them.
The Programs:
The seven programs to be done in this exercise are:
1. Hello World
2. Greeting the user
3. Analysing text from a file and printing some statistics
4. Proper command line processing and analysing a text file to get word frequency,
word size frequency and the type-token ratio.
5. Text analysis and outputting the result to another text file with proper formatting.
6. Read data from a flat file using Perl's database interface and performing SQL
queries on the data.
7. Read rainfall data from a csv file, do some computations and produce a graph
based on the results.
Create a new directory for the programs.
> mkdir perl_exercises
> cd perl_exercises
Download the supporting materials zip file to the newly created directory and unzip the contents
to the directory.
1. Hello World
Create a new file using the gedit text editor.
> gedit hello.pl
Use the following code:
#!/usr/bin/env perl
#The above statement tells the system that this is a perl program.
print "Hello World!\n"; #print the text Hello World and a newline.
Save the file.
Now run the program as follows:
> perl hello.pl
Hello World!
>
The above command asks the perl interpreter to load the file called hello.pl and execute it. On
execution the text "Hello World" is printed on the screen.
2. Greeting the user
This program asks the user's name and the year of birth. It then greats the user and tells the
age of the user.
> gedit name.pl
The Code:
#!/usr/bin/env perl
#
# name.pl
print "Enter you name and press return:";
$name=<STDIN>;
#read the data
chomp($name);
#remove the newline
print "\nEnter your birth year and press return:";
$byear=<STDIN>;
chomp($byear);
#localtime gives the data with 9 distinct values. Collect them.
my ($sec, $min, $hour, $mday, $mon, $year, $wday, $yday, $dst) =
localtime time;
$age=($year + 1900) - $byear; #the year starts from 1900 according to
localtime
print "\nHello, $name!\n";
print "You are $age years old.\n";
On execution:
> perl name.pl
Enter you name and press return:Mickey Mouse
Enter your birth year and press return:1928
Hello, Mickey Mouse!
You are 83 years old.
>
3. Analysing text and printing the statistics
This programs read the text file given in the command line, asks the user for the word to search
in the text and prints some statistics about the text. Note that the program will hang if the user
fails to give the name of the file when the program is run. Proper handling of commandline
arguments is explored in the next exercise.
> gedit words.pl
The Code:
#!/usr/bin/env perl
#
#words.pl word FILE
#
#if no data filename is given, this program will hang
print "Enter the word you want to search for an press return:";
$sword=<STDIN>;
chomp($sword);
$scount = 0;
$bcount = 0;
#search counter
#blank line counter
while(<>)
#continue reading as long as there is input
{
chomp;
#remove newline from each line
foreach $w (split)
#split each line into words
{
if ($w eq $sword)
{
$scount++; #search hit counter
}
$words++;
$char += length($w);
}
#if the length of the current line is 0, we have a blank line
if (length($_) == 0)
{
$bcount++;
}
}
$avgw = $words/$.;
$avgc = $char/$words;
#average words per line including blank lines
#average characters per word
print "There are $. lines in this file including $bcount blank
lines.\n";
print "There are $words words in this file.\n";
print "There are $char characters in this file.\n";
print "The average number of words per line is $avgw.\n";
print "The average number of characters per word is $avgc.\n";
print "the word $sword occurs in the text $scount times.\n";
On execution:
> perl words.pl constitution_preamble.txt
Enter the word you want to search for an press return:the
There are 13 lines in this file including 6 blank lines.
There are 85 words in this file.
There are 470 characters in this file.
The average number of words per line is 6.53846153846154.
The average number of characters per word is 5.52941176470588.
the word the occurs in the text 4 times.
The file constitution_preamble.txt is part of the support file archive which was unzipped at the
beginning.
4. Command line processing and more text analysis
This program also reads from a text file and analyses the text. Proper command line handling is
now performed. The program converts all input text into lower case and strips off all the
punctuation marks in the text. The use of hashes is introduced.
> gedit wordcount.pl
The Code:
#!/usr/bin/env perl
#
#wordcount.pl FILE
#
#if no filename is given, print help and exit
if (length($ARGV[0]) < 1)
{
print "Usage is : words.pl word filename\n";
exit;
}
my $file = $ARGV[0];
#filename given in commandline
open(FILE, $file);
#open the mentioned filename
while(<FILE>)
#continue reading until the file ends
{
chomp;
tr/A-Z/a-z/;
#convert all upper case words to lower case
tr/.,:;!?"(){}//d;
#remove some common punctuation symbols
#We are creating a hash with the word as the key.
#Each time a word is encountered, its hash is incremented by 1.
#If the count for a word is 1, it is a new distinct word.
#We keep track of the number of words parsed so far.
#We also keep track of the no. of words of a particular length.
foreach $wd (split)
{
$count{$wd}++;
if ($count{$wd} == 1)
{
$dcount++;
}
$wcount++;
$lcount{length($wd)}++;
}
}
#To print the distinct words and their frequency,
#we iterate over the hash containing the words and their count.
print "\nThe words and their frequency in the text is:\n";
foreach $w (sort keys%count)
{
print "$w : $count{$w}\n";
}
#For the word length and frequency we use the word length hash
print "The word length and frequency in the given text is:\n";
foreach $w (sort keys%lcount)
{
print "$w : $lcount{$w}\n";
}
print "There are $wcount words in the file.\n";
print "There are $dcount distinct words in the file.\n";
$ttratio = ($dcount/$wcount)*100;
#Calculating the type-token ratio.
print "The type-token ratio of the file is $ttratio.\n";
On execution:
> perl wordcount.pl constitution_preamble.txt
The words and their frequency in the text is:
1949 : 1
a : 1
adopt : 1
all : 2
among : 1
and : 8
assembly : 1
assuring : 1
belief : 1
citizens : 1
constituent : 1
constitute : 1
constitution : 1
day : 1
democratic : 1
dignity : 1
do : 1
economic : 1
enact : 1
equality : 1
expression : 1
faith : 1
fraternity : 1
give : 1
having : 1
hereby : 1
in : 1
india : 2
individual : 1
integrity : 1
into : 1
its : 1
justice : 1
liberty : 1
nation : 1
november : 1
of : 7
opportunity : 1
our : 1
ourselves : 1
people : 1
political : 1
promote : 1
republic : 1
resolved : 1
secular : 1
secure : 1
social : 1
socialist : 1
solemnly : 1
sovereign : 1
status : 1
the : 5
them : 1
this : 2
thought : 1
to : 5
twenty-sixth : 1
unity : 1
we : 1
worship : 1
The word length and frequency in the given text is:
1 : 1
10 : 5
11 : 2
12 : 2
2 : 15
3 : 18
4 : 6
5 : 7
6 : 8
7 : 7
8 : 9
9 : 5
There are 85 words in the file.
There are 61 distinct words in the file.
The type-token ratio of the file is 71.7647058823529.
5. Text analysis with results output to another file
This program analyses the text of a file and outputs the results to another file after formatting
the output.
> gedit freqcount.pl
The Code:
#!/usr/bin/env perl
#
#freqcount.pl FILE
#
use strict; #using strict mode to help us find errors easily
#all variables being used are declared
my $file;
my $wd;
my %count;
my $w;
if (@ARGV) #Check if the ARGV array exists. This array is poulated with
#the arguments passed to the program.
{
$file = $ARGV[0]; #First argument is the data file name.
}
else
{
die "Usage : freqcount.pl FILE\n"; #Bail out if no data filename
#is given
}
open(FILE, $file);
open(RESULTS, ">freqcount.txt");
#Open the file where the results
#will be written. If it exists it will be overwritten.
while(<FILE>)
{
chomp;
tr/A-Z/a-z/;
tr/.,:;!?"(){}//d;
foreach $wd (split)
{
$count{$wd}++;
}
}
print RESULTS "Word\t\tFrequency\n"; #Writing to newly opened file
foreach $w (sort by_number keys%count)
#using our by_number function
{
write(RESULTS);
}
close(RESULTS);
#Our sorting function.
#The <=> is used to sort the result in a descending order of frequency.
#The second <=> is used to sort the result based on the length of the
#word if the frequency is same.
sub by_number
{
$count{$b} <=> $count{$a} || length($b) <=> length($a);
}
#Formatting the results.
#A @ denotes the values to be printed.
#A < stands for left justify text in that position, > stand for right
#justify.
#The formatting ends with a final .
format RESULTS=
@<<<<<<<<<<<<<<< @>>
$w,
$count{$w}
.
On Execution:
> perl freqcount.pl constitution_preamble.txt
> cat freqcount.txt
Word
Frequency
and
8
of
7
the
5
to
5
india
2
this
2
all
2
twenty-sixth
1
constitution
1
opportunity
1
constituent
1
individual
1
constitute
1
expression
1
democratic
1
fraternity
1
ourselves
1
integrity
socialist
political
sovereign
solemnly
assembly
citizens
resolved
november
economic
equality
assuring
republic
thought
dignity
worship
liberty
promote
justice
secular
secure
social
people
belief
nation
status
having
hereby
unity
among
faith
adopt
enact
give
them
1949
into
our
day
its
in
do
we
a
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
6. Connecting to databases
This program shows how to connect to a database using perl DBI. In place of the database we
use a csv file which is accessed using the csv database driver.
The database driver is a third party module which needs to be installed. The driver name is
DBD::CSV and it is installed by the following command.
> yum install perl-DBD-CSV
Once the driver is installed we can start. We will be using the Indian_capitals.csv from the
support files zip.
>
gedit dbdcsv.pl
The Code:
#!/usr/bin/env perl
#
#dbdcsv.pl
#
# Shows connecting to databases using perl. The database here is a CSV
file.
# The same principle applies to any database connection.
use strict;
use warnings;
use DBI;
#using perl database interface
# Connect to the database, (the directory containing our csv file)
my $dbh = DBI->connect("DBI:CSV:f_dir=.;csv_eol=\n;");
# Associate our csv file with the table name 'Indian_capitals.csv' and
# manually declare names for each of the columns
$dbh->{'csv_tables'}->{'Indian_capitals.csv'} = {
'col_names' => ["state", "admin_c","legis_c","year","old_c"]};
# Run a SQL command on the database which gices the list of all states
#whose capital city was changed.
# The SQL statement is prepared first before being executed.
# This provides us with a place to verify that the sql is proper and
#valid.
# A limited set of SQL operations can be performed here.
# If we are actually connecting to a database, all the SQL supported by
#the DB can be done here.
# This statemnt needs to be vetted for any SQL injection vulnerabilities
#in a real life scenario.
my $sth = $dbh->prepare("SELECT * FROM Indian_capitals.csv WHERE old_c
NOT LIKE '—'");
$sth->execute();
print ("\nThe list of states which had their capitals changed:\n");
# The output from the SQL statement is fetched and the results are
#printed.
while (my $row = $sth->fetchrow_hashref)
{
print($row->{'state'},"\n");
}
# The statement is closed.
$sth->finish();
On Execution:
> perl dbdcsv.pl
The list of states which had their capitals changed:
State_or_UT
Andhra Pradesh
Assam
Gujarat
Karnataka
Kerala
Madhya Pradesh
Orissa
Punjab
7. Charting data
This programs reads rainfall data from a number of stations for a period of 2 months. It then
calculates how far from the normal, the actual rainfall varies and plots a nice bar chart based on
the data. The data is once again in a csv file. The csv file is read as it is ie. it is not considered
as a database. The file rainfall.csv is in the support files zip.
First the charting package has to be installed. the package is GD::Graph. The package for
reading the CSV file is Text::CSV.
> yum install perl-GDGraph perl-Text-CSV
> gedit rainfall.pl
The Code:
#!/usr/bin/env perl
#
#rainfall.pl
#
# This program reads data from a csv file containing a list of stations,
# actual rainfall and the average rainfall in each station over a period
# of two months from June to July 2011 for the states of Tamil Nadu and
# Pondicherry
use
use
use
use
strict;
warnings;
Text::CSV;
#for reading the csv file
GD::Graph::hbars;
#for drawing the horizontal bar graph
my $file = "rainfall.csv";
my $csv = Text::CSV->new();
open (CSV, "<", $file) or die $!;
my
my
my
my
my
my
my
my
@columns;
$station;
$rainfall;
@gprecip;
$err;
$s;
@station;
@prainfall;
#open the specified file for reading
#array holding the data from the csv file
#individual station
#rainfall in each station
#array holding the charting data
#array of stations
#array of positive rainfall
my
my
my
my
@nrainfall;
$nr;
$ns;
$my_graph;
#array of negative rainfall
#chart variable
while(<CSV>)
{
next if ($. == 1);
#ignore the first line since it will have
#the headings
if ($csv->parse($_))
{
@columns = $csv->fields();
#load values into array from
#the file
$station = $columns[0];
$ns = push(@station, $station);
#build the array of
#stations
$rainfall = ((($columns[1]/$columns[2])*100) - 100 );
#To draw the chart with different colors we need the
#rainfall values in 2 arrays
#One for the positive values of rainfall
#Another for the negative values of rainfall
if ($rainfall > 0)
{
$nr = push(@prainfall,
$nr = push(@nrainfall,
}
else
{
$nr = push(@nrainfall,
$nr = push(@prainfall,
}
int($rainfall));
undef);
int($rainfall));
undef);
}
else
{
$err = $csv->error_input;
print "Failed to parse line: $err";
}
}
close CSV;
@gprecip = (\@station, \@prainfall, \@nrainfall);
#combine the
#arrays to form the big array for GD::Graph
$my_graph = GD::Graph::hbars->new(480, 640);
#create the chart
#Set parameters of the graph. See GD::Graph documentation for
#details.
$my_graph->set(
x_label => 'Station Name',
y_label => 'Percentage',
title => 'Rainfall % in TN and Pondicherry in June & July 2011',
y_max_value => 200,
y_min_value => -100,
overwrite => 1,
dclrs => [qw (green lred) ],
legend_placement => 'RB',
show_values => 1,
transparent=>0,
);
$my_graph->set_legend('Excess', 'Deficit');
#Write the bar graph into a png file
open(IMG, ">rainfall.png") or die $!;
binmode IMG;
print IMG $my_graph->plot(\@gprecip)->png;
close IMG;
On execution, a new file rainfall.png will be created in the current directory.
> perl rainfall.pl
--
Don't ever give up.
Even when it seems impossible,
Something will always
pull you through.
The hardest times get even
worse when you lose hope.
As long as you believe you can do it, You can.
But When you give up,
You lose !
I DONT GIVE UP.....!!!
with regards
prem sasi kumar arivukalanjiam
No comments:
Post a Comment