Both Perl and Python will let you download webpages off of the internet. You can do more than just download webpages, such as ftp, gopher, and connect to other services. Downloading a webpage is just one thing these languages can do.
There are several things the programming language has to do:
This article isn't going to be too long. I commented the Python code.
<!--#include virtual="/lthead.html" -->in your webpage. Various programming languages (like PHP, Perl ASP, Perl Mason, etc) can also include files.
It is assumed you are using a GNU/Linux operating system. Also, I was using Python 1.5.2, which is not the latest version. You might have to do a "chmod 755 LinuxToday.py" on the script to make it executable.
#!/usr/bin/python # One obvious thing to do is apply error checking for url download, # download must contain at least one entry, and we are able to create the # new file. This will be done later. ### import the web module, string module, regular expression, module ### and the os module import urllib, string, re, os ### define the new webpage we create and where to get the info Download_Location = "/tmp/lthead.html" Url = "http://linuxtoday.com/backend/lthead.txt" #----------------------------------------------------------- ### Create a web object with the Url LinuxToday = urllib.urlopen( Url ) ### Grab all the info into an array (if big, change to do one line at a time) Text_Array = LinuxToday.readlines() New_File = open(Download_Location + "_new", 'w'); New_File.write("<ul>\n") ### Set the default to be invalid Valid = 0 ### Record the number of valid entries Entry_No = 0; Entry_Valid = 0 ### Setup the defaults Date = "" Link = "" Header = "" Count = 0 ### Create the mattern matching expression Match = re.compile ("^\&\&") ### Append && to make sure we parse the last entry Text_Array.append('&&') ### For each line, do the following for Line in Text_Array : ### If && exists, start from scratch, add last entry if Match.search(Line) : ### If the current entry is valid and we have skipped the first one, if (Entry_No > 1) and (Entry_Valid > 0) : ### One thing that Perl does better than Python is the print command. I don't like how Python prints. ### Perl's printing is more intuitive and easier to read (for me at least). New_File.write('<li> <a href="' + Link + '">' + Header + '</a>. ' + Date + "</li>\n") ## Reset the values to nothing. Header = ""; Link = ""; Date = ""; Entry_Valid = 0 Count = 0 ### Delete whitespace at end of line Line = string.rstrip(Line) ### If count is equal to 1, header, 2 link, 3 date if Count == 1: Header = Line elif Count == 2: Link = Line elif Count == 3: Date = Line ### If all fields are done, we have a valid entry if (Header != "") or (Link != "") or (Date != "") : Entry_No = Entry_No + 1 Entry_Valid = 1 ### Add one to Count Count = Count + 1 New_File.write("</ul>\n") ### Close the file. New_File.close(); ### If we have valid entries, move the new file to the real location if Entry_No > 0 : ### We could just do: ### os.rename(Download_Location + "_new", Download_Location) ### But here's how to do it with an external command. Command = "mv " + Download_Location + "_new " + Download_Location os.system( Command )
#/bin/sh ### Crontab file ### Name the file "Crontab" and execute with "crontab Crontab" ### Download every two hours */2 * * * * /www/Cron/LinuxToday.py >> /www/Cron/out 2>&1
#!/usr/bin/perl # Copyrighted by Mark Nielsen January 2001 # Copyrighted under the GPL license. # I am proud of this script. # I wrote it from scratch with only 2 minor errors when I first tested it. system ("lynx --source http://www.linuxgazette.com/ftpfiles.txt > /tmp/List.txt"); ### Open up the webpage we just downloaded and put it into an array. open(FILE,'/tmp/List.txt'); my @Lines = <FILE>; close FILE; ### Filter out lines that don't contain magic letters. my @Lines = grep(($_ =~ /lg\-issue/) || ($_ =~ /\.tar\.gz/), @Lines ); my @Numbers = (); foreach my $Line (@Lines) { ## Throw away the stuff to the left my ($Junk,$Good) = split(/lg\-issue/,$Line,2); ## Throw away the stuff to the right ($Good,$Junk) = split(/\.tar\.gz/,$Good,2); ## If it is a valid number, it is greater than 1, save it if ($Good > 0) {push (@Numbers,$Good);} } ### Sort the numbers and pop off the highest @Numbers = sort {$a<=>$b} @Numbers; my $Highest = pop @Numbers; ## Create the url we are going to download my $Url = "http://www.linuxgazette.com/issue$Highest/index.html"; ## Download it system ("lynx --source $Url > /tmp/LG_index.html"); ### Open up the index. open(FILE,"/tmp/LG_index.html"); my @Lines = <FILE>; close FILE; ### Extract out the parts that are between beginning and end of TOC. my @TOC = (); my $Count = 0; my $Start = '<!-- *** BEGIN toc *** -->'; my $End = '<!-- *** END toc *** -->'; foreach my $Line (@Lines) { if ($Line =~ /\Q$End\E/) {$Count = 2;} if ($Count == 1) {push(@TOC, $Line);} if ($Line =~ /\Q$Start\E/) {$Count = 1;} } ### Relink all the links to point to the Linux Gazette magazine my $Relink = "http://www.linuxgazette.com/issue$Highest/"; grep($_ =~ s/HREF\=\"/HREF\=\"$Relink/g, @TOC); ### Save the output open(FILE,">/tmp/TOC.html"); print FILE @TOC; close FILE; ### Done!
#!/usr/bin/perl # Copyrighted by Mark Nielsen January 20001 # Copyrighted under the GPL license. system ("lynx --source http://www.debian.org/News/weekly/index.html > /tmp/List2.txt"); ### Open up the webpage we just downloaded and put it into an array. open(FILE,'/tmp/List2.txt'); my @Lines = <FILE>; close FILE; ### Extract out the parts that are between beginning and end of TOC. my @TOC = (); my $Count = 0; my $Start = 'Recent issues of Debian Weekly News'; my $End = '</p>'; foreach my $Line (@Lines) { if (($Line =~ /\Q$End\E/i) && ($Count > 0)) {$Count = 2;} if ($Count == 1) {push(@TOC, $Line);} if ($Line =~ /^\Q$Start\E/i) {$Count = 1;} } ### Relink all the links to point to the LWN my $Relink = "http://www.debian.org/News/weekly/"; grep($_ =~ s/HREF\=\"/HREF\=\"$Relink/ig, @TOC); grep($_ =~ s/\"\>/\" target=_external\>/ig, @TOC); ### Save the output open(FILE,">/tmp/D.html"); print FILE @TOC; close FILE; ### Done!
Python rules as a programming language. I found it very easy to use the Python modules. It seems like the Python module for handling webpages is easier than the LWP module in Perl. Because of the many possibilities of Python, I plan on creating a Python script which will download many webpages at the same time using Python's threading capbilities.
Thanks to Mike Orr for suggestions on this article and others.