Click to See Complete Forum and Search --> : Python - Text tools replacement?
mart_man00
09-01-2003, 02:02 PM
Im wondering how good Python is when it comes to text?
Im really wondering if i can learn this one langauge instead of a alot of other commands(sed,awk,grep,perl) and still be able to get around.
Whats your opion? Could you come up with one or two small examples for yours please?
Thanks
mrBen
09-01-2003, 02:44 PM
Python has a regular expressions module (called re) which can do most things you would expect from regular expressions. I've also found it's list and string handling very helpful.
Some examples (off the top of my head):
import re
temp="A string variable with the name Bob in the middle"
if (re.search("Bob", temp)):
print "Bob was here"
will show that Bob was here.
You can use proper regular expression meta characters too, so:
re.search("^q[\D\d]*.html$", filelist)
will search filelist for a file that begins with q, has any number of other characters (\D means any non-digit and \d means any digit - I think there is probably a better way to right this) and ends in .html
As for lists and strings, my standard example for help here is as follows:
import string
configfile=open("config.ini",'r')
configs=configfile.readline()
configpair=string.split(configs, '=')
This gives you a list that contains 2 objects - one which is the variable name, the other which is the value that should be assigned to it. I normally use a while loop to grab and then assign the variables until the end of the file, although there are probably neater ways of doing this
Hope that gives you a flavour - I'm not the best at Python, but I've found it so easy to use, and easy to develop complex utilities quickly and easily, and it's string/text handling is very good, although not perhaps up to the standard of Perl quite yet.
Strike
09-03-2003, 06:03 PM
Originally posted by mrBen
As for lists and strings, my standard example for help here is as follows:
import string
configfile=open("config.ini",'r')
configs=configfile.readline()
configpair=string.split(configs, '=')
Of course, there is a ConfigParser module that does this in an even better way ;)
Python's text/string handling is as good as any language, and better than most. It may not be as FAST as many of them, but as far as utility goes, Python is way up there.
Also, avoid regular expressions unless you must use them. Don't fall into the habit of using regexes wherever possible!
mrBen
09-04-2003, 03:20 AM
Originally posted by Strike
Of course, there is a ConfigParser module that does this in an even better way ;)
Ooops - didn't know about that. Must check it out (only really been using Python for a couple of months)
Python's text/string handling is as good as any language, and better than most. It may not be as FAST as many of them, but as far as utility goes, Python is way up there.
As a test for a programme I wrote that searches a directory of html documents for given text, I copied the text of War and Peace into 2000 HTML documents of ~1500 characters each. These 200 documents get loaded into a list, which is then searched - even if you search for a single letter (like e) it still comes in in well under a minute. Most full-word searches come in in about 2 seconds. So it's not too bad for speed, IMHO, although I don't have much experiences of Perl, which I guess would be the nearest competitor.
Also, avoid regular expressions unless you must use them. Don't fall into the habit of using regexes wherever possible!
Could you advise of the best (quickest) way to search for a word of text within a bigger bunch of text. Or is a regex the way to go for that? Currently I use a list of search terms, and a list containing all the text I want to search, and I regex for it, something like this:
# The list search[] contains a number of search terms
# The list contents[] contains whole HTML files in each list item
found=0
for loop in contents:
for loop2 in search[]:
if (re.search(loop2,loop,re.IGNORECASE)):
found=found+1
if (found==len(search)):
print "I found all search terms"
Looking at the code, I'm guessing it might be quicker at least to first join up my words with AND, and the search for that as a regex rather than looping through the entire thing. Are there any other speed savings I could make?
Strike
09-05-2003, 01:07 PM
Could you advise of the best (quickest) way to search for a word of text within a bigger bunch of text.
Well, part of where we differ is that I don't always think of quickest as the best :) Regexes may be the fastest way of doing this, but I'm not sure. You can make your own tests, if you're curious.
What I mostly dislike is when people use regex for things that are easily done with string methods. Like they use it to split things on certain fields when you can just use .split(), or they use it to replace text when you can just use .replace() ... that sort of thing.
mrBen
09-08-2003, 03:08 AM
Originally posted by Strike
Well, part of where we differ is that I don't always think of quickest as the best :)
:) Agreed. But for the case in point speed was of the essence, as I was searching through 2000+ HTML files, and any saving, however tiny, on the time taken to search one file was going to multiply up into a big saving.
Regexes may be the fastest way of doing this, but I'm not sure. You can make your own tests, if you're curious.
I did do some (unscientific) tests to see if it was quicker to search for x AND y AND z, or to use a loop to search for the individually, and for 2 or 3 search terms it seemed to be slightly quicker to use a loop than to use an re. In the end the only change I made to my original code was a break in the loop if any of the terms was not found (given that it was an AND search, rather than an OR).
What I mostly dislike is when people use regex for things that are easily done with string methods. Like they use it to split things on certain fields when you can just use .split(), or they use it to replace text when you can just use .replace() ... that sort of thing.
Ah. Understood.
Strike
09-08-2003, 11:36 AM
Out of curiosity, did you use re.search() and the like, or did you compile a regex object in your tests? If you didn't do an re.compile() to create a regex object, I'd definitely try your tests again with that setup.