Click to See Complete Forum and Search --> : html source, email seperator
Sensei
04-22-2001, 04:17 PM
Looking for a script that snarfs/seperates an email address or possibly multiple email addys from html source.
See rather than editing the souce or copying and pasting one would just do:
./foo.pl > foo.txt
Is this easy to do?
jemfinch
04-22-2001, 05:08 PM
#!/usr/bin/env python
import re
import sys
import fileinput
emailre = re.compile(r'([0-9A-Za-z\._-]+@[0-9A-Za-z\._-]+)')
FileInput = fileinput.input(sys.argv[1:])
for line in FileInput:
for email in re.findall(emailre, line):
print email
Usage: snarf.py <file1> <file2> ...
Jeremy
[ 22 April 2001: Message edited by: Sensei ]
Sensei
04-22-2001, 06:06 PM
jemfinch,
the script works great! Thanks!
maybe you might want to add it to linuxducks code snippet database?
[ 22 April 2001: Message edited by: Sensei ]
Ben Briggs
04-22-2001, 08:43 PM
Originally posted by Sensei:
<STRONG>
the script works great! Thanks!
</STRONG>
Do you expect anything less from jemfinch and Python? :)
<STRONG>
maybe you might want to add it to linuxducks code snippet database?
</STRONG>
It's nice to see that the management pays attention to it's visitors :).
Mikey123
04-22-2001, 09:30 PM
For those in a perl'ish mood:
command line
perl -ne 'while(/([0-9A-z\._]+\@[0-9A-z\._]+)/g){print $1,"\n";}' filetosnarf.html > saveas.txt
jemfinch
04-22-2001, 09:58 PM
Originally posted by Mikey123:
perl -ne 'while(/([0-9A-z\._]+\@[0-9A-z\._]+)/g){print $1,"\n";}' filetosnarf.html > saveas.txt
'Tis only a minor quibble, but it's much safer to use "A-Za-z" than "A-z". Some character sets may have intervening letters between the sets of capital and lowercase letters.
Jeremy
Sensei
04-25-2001, 01:56 PM
I priv'd messaged jemfinch, but I think others may benefit or have the answer...
the above script from jemfinch works 'impressively' well, but I was wondering what to add to it to make it strip duplicate emails from the destination file.
YaRness
04-25-2001, 02:14 PM
here you go sensei:
#!/usr/local/bin/perl -w
use strict;
use warnings;
open (FILE, $ARGV[0]) or die "$0: something went wrong, stopped";
my %hash;
while (<FILE> )
{
while (/([0-9A-Za-z\._]+\@[0-9A-Za-z\._]+)/g)
{
$hash{$1} += 1;
}
}
close FILE;
print "$_\n" foreach (sort keys %hash);
save that as "foo.pl" or whatever, and do "./foo.pl snarf.html > output" (<edit>don't forget to chmod it)
it's not very well written (doesn't check for mangled or insidious file names), but it gets the job done.
it will print them in some kinda alphabetical like order. you also change "print "$_\n" foreach (sort keys %hash);" to "print "$_ $hash{$_}\n" foreach (sort keys %hash);" and you will get the number of times each email was found.
<edit> fixed some typoes and stuff
<edit2> and some other stuff :o can't remember which built-in is most appropriate for the die line. so i took it out
<edit3> and changed the A-z to A-Za-z (copied the wrong regexp.
[ 25 April 2001: Message edited by: YaRness ]
[ 25 April 2001: Message edited by: YaRness ]
[ 25 April 2001: Message edited by: YaRness ]
Sensei
04-25-2001, 02:21 PM
hmm doesn't seem to be working...
I get this:
bash: ./foo.pl: No such file or directory
when I run:
./foo.pl snarfed.html > newsnarf.html
YaRness
04-25-2001, 02:26 PM
got me, it worked when i did "perl foo.pl yar.txt" here.
make sure it's not a SBKC (Short Between the Keyboard and Chair.. ie make sure yer files are there and all that, and you named the script what you think you named it). doesn't seem like a script problem.
<edit>also, make sure the shebang line reflects where your perl proggie is located (you can also run it like "perl foo.pl snarf.html > output"). that's all i can think of
[ 25 April 2001: Message edited by: YaRness ]
Sensei
04-25-2001, 02:48 PM
you were right, totalyl my overlook...was teh shabang line DOH! :eek:
Sensei
04-25-2001, 02:48 PM
PS it seems to work very well, im gonna run it on some files and see how it does.
YaRness
04-25-2001, 03:14 PM
rawkinroll.
TheLinuxDuck
04-25-2001, 03:49 PM
#!/usr/bin/perl -w
while(<> ) { while (/([0-9A-Za-z\._]+\@[0-9A-Za-z\._]+)/g) { $emails{$1}=0; } }
for(keys %emails) { print "$_\n"; };
Same regexp as YaR's, but real simple. ()^:=
jemfinch
04-25-2001, 04:07 PM
Oh, you privmsg'ed me? I never see those things :)
#!/usr/bin/env python
import re
import sys
import fileinput
emailre = re.compile(r'([0-9A-Za-z\._-]+@[0-9A-Za-z\._-]+)')
emails = {}
FileInput = fileinput.input(sys.argv[1:])
for line in FileInput:
for email in re.findall(emailre, line):
emails[email] = 1
for email in emails.keys():
print email
That'll print unique emails only.
Mikey123
04-25-2001, 04:57 PM
Just to add my two cents and to show our motto 'tmtowtdi'
perl -ne 'while(/([0-9A-Za-z\._]+\@[0-9A-Za-z\._]+)/g){print $1,"\n" unless $seen{$1}++;}' filetosnarf.html > saveas.txt
Sensei
04-25-2001, 05:17 PM
wow the new python script hardly pushes the disk access rate up, where the perl script pushes it to 100%. :cool:
Mikey123
04-25-2001, 05:39 PM
Interesting.
I would love to hear what the difference is as I know squat about python. Anybody???
TheLinuxDuck
04-25-2001, 05:39 PM
But mine handles unlimited files (per OS), is MUCH MUCH smaller, and also only returns unique email addresses....
(:
Mikey123
04-25-2001, 05:58 PM
Originally posted by TheLinuxDuck:
<STRONG>
#!/usr/bin/perl -w
while(<> ) { while (/([0-9A-Za-z\._]+\@[0-9A-Za-z\._]+)/g) {print $1,"\n" unless $emails{$1}++; } }
</STRONG>
to make it even shorter and skip an uneccessary loop
OK I know we're getting repetitive but i just love strippin the simple stuff to the bone.
TheLinuxDuck
04-25-2001, 06:21 PM
Originally posted by Mikey123:
<STRONG>
#!/usr/bin/perl -w
while(<> ) { while (/([0-9A-Za-z\._]+\@[0-9A-Za-z\._]+)/g) {print $1,"\n" unless $emails{$1}++; } }
</STRONG><HR></BLOCKQUOTE>
to make it even shorter and skip an uneccessary loop
OK I know we're getting repetitive but i just love strippin the simple stuff to the bone.[/QB]
Qool! I was hoping there was some way to ge rid of the extra loop. (^:=
jemfinch
04-25-2001, 06:21 PM
Originally posted by Sensei:
wow the new python script hardly pushes the disk access rate up, where the perl script pushes it to 100%. :cool:
Is that a good thing or a bad thing?
Also, which did you run first? If the dataset is small enough, then the one you run second will be reading the buffer cache, not the disk itself. If you want an accurate test without giving one an advantage, do a "find /" between runs.
Jeremy
jemfinch
04-25-2001, 06:56 PM
Originally posted by Mikey123:
#!/usr/bin/perl -w
while(<> ) { while (/([0-9A-Za-z\._]+\@[0-9A-Za-z\._]+)/g) {print $1,"\n" unless $emails{$1}++; } }
OK I know we're getting repetitive but i just love strippin the simple stuff to the bone.
But then it's not so simple, is it?
Jeremy
Sensei
04-25-2001, 10:59 PM
Originally posted by jemfinch:
<STRONG>Is that a good thing or a bad thing?
Also, which did you run first? If the dataset is small enough, then the one you run second will be reading the buffer cache, not the disk itself. If you want an accurate test without giving one an advantage, do a "find /" between runs.
Jeremy</STRONG>
Thats good.
:rolleyes:
YaRness
04-26-2001, 08:15 AM
here's another go, with less code and no hash:
#!/usr/local/bin/perl
undef $/, $_ = <>;
(print "$1\n" and s/$1//g) while (/([0-9A-Za-z\._]+\@[0-9A-Za-z\._]+)/)
put that in foo.pl and do "./foo.pl < snarf.html > output"
[ 26 April 2001: Message edited by: YaRness ]
YaRness
04-26-2001, 08:31 AM
i wonder if you could just change the '[0-9A-Za-z\._]+' to '[^ <>"]+' (that's whitespace, both angle brackets, and a double quote mark) or something. that's all i can think that might be around an email address in html. but i didn't think about it very hard.
jemfinch
04-26-2001, 09:40 AM
Originally posted by Sensei:
Thats good.
:rolleyes:
So you're telling me that the perl version went significantly faster than the python version?
Jeremy
YaRness
04-26-2001, 09:54 AM
i still think you mighta been right about the buffer thing. i don't have a python interpreter handy (not to mention i'm not sitting in linux), but a more proper comparison would be interesting. the only bottleneck for so simple a program seems to me would be loading the interpreter. if you get a bigger file, then there might be something else.
Mikey123
04-26-2001, 10:04 AM
Originally posted by YaRness:
<STRONG>here's another go, with less code and no hash:
#!/usr/local/bin/perl
undef $/, $_ = <>;
(print "$1\n" and s/$1//g) while (/([0-9A-Za-z\._]+\@[0-9A-Za-z\._]+)/)
put that in foo.pl and do "./foo.pl < snarf.html > output"
[ 26 April 2001: Message edited by: YaRness ]</STRONG>
I think we will find that the hash version is more efficient than recompiling another regex for each match. Also likely want to add the 'o' option to the regex as well. I may run a few benchmarks of my own if I have time later.
Thanks for the fun guys :)
Sensei
04-26-2001, 10:32 AM
no I'm saying the python version went faster, and actually used alot of processor and not alot of disk access, with the perl script it was both.
jcrowe
04-26-2001, 10:37 AM
I think it would be interesting to use that random email generater that was posted a while back and do a comparison to see which is faster.
jcrowe
YaRness
04-26-2001, 10:40 AM
Originally posted by Mikey123:
<STRONG>
I think we will find that the hash version is more efficient than recompiling another regex for each match. Also likely want to add the 'o' option to the regex as well. I may run a few benchmarks of my own if I have time later.
Thanks for the fun guys :)</STRONG>
processor time VS memory.
both kinda small potatoes at that. i just like writing code in different ways.
i wonder if doing two iterations, one to snag emails (reading data from file) and one to ditch repeats (messing with stuff in memory), would be siginificantly different in speed than ditching repeats on-the-fly.
coding is fun.
Sensei
04-26-2001, 10:50 AM
This has definitely been eye opening. Very impressed that you guys can whip out code like that on the drop of a hat. :cool:
TheLinuxDuck
04-26-2001, 10:55 AM
Most of these folks whip stuff like this out every day. (^:= I think that's why I enjoy the programming forum the most of the forums here..
YaRness
04-26-2001, 11:02 AM
Originally posted by TheLinuxDuck:
<STRONG>Most of these folks whip stuff like this out every day. (^:= I think that's why I enjoy the programming forum the most of the forums here..</STRONG>
yeah, it impresses my boss too. same as knowing anything though, enough experience doing a thing and you can make any task seem to be easy and stuff (whether or not it actually is).
hopefully soon i'll have a boss that actually understands it as well... then i'll hafta work harder.
jemfinch
04-26-2001, 02:33 PM
Originally posted by Mikey123:
#!/usr/local/bin/perl
undef $/, $_ = <>;
(print "$1\n" and s/$1//g) while (/([0-9A-Za-z\._]+\@[0-9A-Za-z\._]+)/)
I think we will find that the hash version is more efficient than recompiling another regex for each match.
Perl and Python both cache recently used regular expressions, so you're not really recompiling the regexp every time through.
Just as a note, though, perl has qr// for compiling a regular expression. It returns a compiled regexp that can be used anywhere a normal one could be used.
Jeremy