ScRapZ_1
08-11-2002, 02:52 PM
Basically, I want to assign a string of text to a variable. Sounds easy, but theres kind of a catch.
I've made a script that retrives data from a webpage and displays it. It downloads the HTML file, parses the relevant data into a local file, then I use readline() to assign it to a string. I do this by using wget to get the HTML file, then by using grep and cut I get the text I want, I then echo the result into the end of a file. What I want to do is skip that grep and cut bit, and parse the HTML file from within the program itself, not by using os.system() calls. The process I'm using at the moment is long and messy... it works, but I just want to clean it up and make it look better. I'm also hoping to move this program to windows, so things like grep wont work.
Any ideas anyone?
TTFN,
Scrapz :p
Regular expressions? A while back I made the following tiny function, that works a bit like grep, hence the name:
import re
def egrep(l, ere):
for s in l:
m = re.search(ere, s)
if m:
return m
return None
It should be rather obvious what it does. HTH.
Scrapz, you know moobot right? I've written a number of modules for moobot which do exactly what you want to do. In fact, I suggest that you don't even use wget. It's very easy to obtain a HTML page via Python's HTTPLib
Click here (http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/moobot/moobot/weblookup.py?rev=1.27&content-type=text/vnd.viewcvs-markup) for moobot's weblookup stuff.
Here is one specific example:
# this class implements the cfactive command for moobot
# it accesses the coderforums.net main page and parses
# out the relevant information
class cfactive(MooBotModule):
""" returns a list of currently active users at coderforums.net """
def __init__(self):
self.regex = "^cfactive"
def handler(self, **args):
(1) import httplib, re, sys
from irclib import Event
(2) conn = httplib.HTTPConnection('www.coderforums.net')
conn.request('GET', "")
response = conn.getresponse()
if response.status != 200:
msg = "bad response from coderforums: " + response.status
return Event("privmsg", "", self.return_to_sender(args), [msg])
data = response.read()
data = re.sub('<.*?>', '', data)
(3) match = re.search('on the boards.*?Number of Active', data, re.DOTALL)
if match is None:
msg = "data not found; gosh that sucks"
return Event("privmsg", "", self.return_to_sender(args), [msg])
data = data[match.start():match.end()]
data = re.sub('\s', '', data)
data = re.sub('(ontheboards\.)|(NumberofActive)', '', data)
data = re.sub(',',', ', data)
return Event("privmsg", "", self.return_to_sender(args), [data])
(1) You want to use the httplib (http://python.org/doc/current/lib/module-httplib.html) and re (http://python.org/doc/current/lib/module-re.html) modules.. RE stands for regular expression, dunno how familiar you are with them, but they can be tricky for newcomers. HTTPlib is very easy to understand.
(2) Here's where I use the httplib. Create a connection, make a request for a page (in this case "", since I want the front page). Get the response, and check the status; and finally read the response into a variable. Now data is one big string containing all the html from the page.
(3) Now I use the re module to locate the exact place in the page where I want to be.. I do this with re.search; that's a very basic regular expressin I'm using there. I use regular expression to crop off all the extra info on the page, and use the re.sub function to clean it up a bit before displaying the information. (NOTE: this code may not work anymore, since CF.net was just redesigned.)
It seems this code has all the functionality that you would need to mimic; and it will work from either windows or linux, no problem.
ScRapZ_1
08-14-2002, 06:29 AM
:eek:
My god, I love you kmj. No, really I do... I'm going to go do that now :D
TTFN,
Scrapz :p