Thursday, 29 August 2013

Improve performance of Perl search file script

Improve performance of Perl search file script

I have recently noticed that a quick script I had written in Perl that was
designed to be used on sub 10MB files has been modified, re-tasked and
used in 40MB+ text files with significant performance issues in a batch
environment.
The jobs have been running for about 12 hours per run when encountering a
large text file and I am wondering how do I improve the perfomance of the
code? Should I slurp the file into memory and if I do it will break the
jobs reliance on the line numbers in the file. Any constructive thought
would be greatly appreciated, I know the job is looping through the file
too many times but how to reduce that?
#!/usr/bin/perl
use strict;
use warnings;
my $filename = "$ARGV[0]"; # This is needed for regular batch use
my $cancfile = "$ARGV[1]"; # This is needed for regular batch use
my @num =();
open(FILE, "<", "$filename") || error("Cannot open file ($!)");
while (<FILE>)
{
push (@num, $.) if (/^P\|/)
}
close FILE;
my $start;
my $end;
my $loop = scalar(@num);
my $counter =1;
my $test;
open (OUTCANC, ">>$cancfile") || error ("Could not open file: ($!)");
#Lets print out the letters minus the CANCEL letters
for ( 1 .. $loop )
{
$start = shift(@num) if ( ! $start );
$end = shift(@num);
my $next = $end;
$end--;
my $exclude = "FALSE";
open(FILE, "<", "$filename") || error("Cannot open file ($!)");
while (<FILE>)
{
my $line = $_;
$test = $. if ( eof );
if ( $. == $start && $line =~
/^P\|[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]\|1I\|IR\|/)
{
print OUTCANC "$line";
$exclude = "TRUECANC";
next;
}
if ( $. >= $start && $. <= $end && $exclude =~ "TRUECANC")
{
print OUTCANC "$line";
} elsif ( $. >= $start && $. <= $end && $exclude =~ "FALSE"){
print $_;
}
}
close FILE;
$end = ++$test if ( $end < $start );
$start = $next if ($next);
}
#Lets print the last letter in the file
my $exclude = "FALSE";
open(FILE, "<", "$filename") || error("Cannot open file ($!)");
while (<FILE>)
{
my $line = $_;
if ( $. == $start && $line =~
/^P\|[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]\|1I\|IR\|/)
{
$exclude = "TRUECANC";
next;
}
if ( $. >= $start && $. <= $end && $exclude =~ "TRUECANC")
{
print OUTCANC "$line";
} elsif ( $. >= $start && $. <= $end && $exclude =~ "FALSE"){
print $_;
}
}
close FILE;
close OUTCANC;
#----------------------------------------------------------------
sub message
{
my $m = shift or return;
print("$m\n");
}
sub error
{
my $e = shift || 'unknown error';
print("$0: $e\n");
exit 0;
}

No comments:

Post a Comment