Click to See Complete Forum and Search --> : C socket problem


monkeyboi
12-15-2002, 05:10 AM
i working a very small textbase webbrowser which only support a few html tags anyway the problem has nothing to do with the html parsing but getting the html source....

so i create the socket,
connect to some site and
then send the command to it

something like this..
ex. GET / HTTP/1.0\n\n

the problem is the receiving.

recv(sockdesc, buf,bufsize, 0);

i understand that the server don't send out html at once but rather break them in parts....

the code above will get the header and store to the buf variable.... but i want the whole thing so i use the while to loop until it receive nothing.

while (recv(sockdesc,buf,bufsize,0)>0)
{
strcat(totalbuf, buf);
}

i thought this would work fine... but i was wrong.. it only gets 30% to 80% of the actual html...

y? can any1 show me the correct way to do receive the whole html content of a page?

btw i tried with over 10 different sites including yahoo.com and hotmail.com

thanks in advance!

monkeyboi
12-15-2002, 07:51 PM
please any1???
i'm really stuck
thx

Spawn913
12-16-2002, 12:09 AM
some questions... in the lines:

while (recv(sockdesc,buf,bufsize,0)>0)
{
strcat(totalbuf, buf);
}

wouldn't this block? or did you set your socket to nonblocking mode?

I haven't had much experience on the HTML protocol, but have worked on some socket programming. Do you know beforehand the length of the HTML packet you should be receiving?

monkeyboi
12-16-2002, 01:07 AM
yah it's in the default block mode....

and it depends on whether the server send the Content-length in it's header, some server don't send the content size, so i hav to assume that all don't.....

for example if i request the file /index.html from some webserver... the server than break that file into different sections and store them in a buffer and then send them one by one.... that means the client has to execute the recv() function many times to full retrieve all the contents.... but it doesn't work for some reason...

i also tried parsin the content-length in the first packet (the header) to get the total length of the whole contents....

jux say the html file i'm about to receive is 12088 bytes
here is part of the code

int a, b,c;

while(c < 12088)
{

a = recv(sockdesc,buf,bufsize,0);
if (a <0) return(-1);

c +=a;
}

this still doesn't give me the whole html content....

thx!

majidpics
12-16-2002, 01:33 AM
the problem is nothing except you are not calculating any buffer size and not setting any block size. your approach is right but your way of approach is confusing.

Spawn913
12-16-2002, 02:39 AM
One comment:


int a, b,c;

c=0;

while(c < 12088)
{
a = recv(sockdesc,buf,bufsize,0);
if (a <0) return(-1);
c +=a;
}


Also, like what majidpics said, what is buf and bufsize?

What does the length header field in an HTML packet indicate? does it indicate the length of the data or is it the length of the whole packet (including the headers)?

monkeyboi
12-16-2002, 03:47 PM
buf is char array to hold the data...

bufsize is a #define value...i set bufsize to over 100kb, this bigger than any packet sent by the server....

goon12
12-16-2002, 03:52 PM
I had a similar problem but instead of

while (recv(sockdesc,buf,bufsize,0)>0)
{
strcat(totalbuf, buf);
}

I used

while (recv(sockdesc,buf,bufsize,0)>0)
{
strncat(totalbuf, buf, sizeof(buf));
}


Hope that helps,
goon12

monkeyboi
12-16-2002, 06:51 PM
the strncat and strcat both gives the same result....

anyway it workin but not all sites tho, fukken weird... about half of the sites i tried will work perfectly...

here part of the code where the problem lies...
<code>



/*
i got it up to this point where it will work fine half of the time....
give it a try yourself
gcc test.c -o test
to run
chmod 755 test
./test www.yahoo.com


monkeyboi
*/

#include <stdio.h>
#include <netdb.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <string.h>


#define bufsize 102400


int main(int argc, char *argv[])
{

struct sockaddr_in sin;
struct hostent *host;

int sockdesc, numbytes, i, a;
char buf[bufsize], buffer[999999], full_path[500];
sin.sin_family = AF_INET;
sin.sin_port = htons(80);



host = gethostbyname(argv[1]);

if (host)
{
memcpy((caddr_t)&sin.sin_addr.s_addr, host->h_addr, host->h_length);
}
else
{
perror("Resolving");
exit(-1);
}



strcpy(full_path, "GET / HTTP/1.0\n\n ");



sockdesc = socket(AF_INET, SOCK_STREAM, 0);


if (sockdesc < 0)
{
printf("socket error!");
return(-1);
}


if (connect(sockdesc, (struct sockaddr *)&sin, sizeof(sin)) < 0)
{
perror("connect");
close(sockdesc);
return(-1);

}


if(send(sockdesc, full_path, strlen(full_path), MSG_OOB) <0)
{

printf("Error sending packet\n");
return(-1);

}


numbytes = 0;
strcpy(buffer, "");
while ((a = recv(sockdesc, buf,bufsize,0)) >0)
{

strcat(buffer, buf);
numbytes +=a;

}

buffer[numbytes] = '\0';
//print out the html content
printf("\n%s\n__________________________________\n\n", buffer);
//total bytes received
printf("Total received length: %i bytes\n", numbytes);

close(sockdesc);

return 0;
}

</code>

monkeyboi
12-16-2002, 06:53 PM
damn it the code tag doesn't work, what is the tag to post code anyway???

Spawn913
12-16-2002, 10:49 PM
first off, code tags: " ... "

next, you're right. your code was weird. Sometimes it completes it, sometimes it doesnt.

Here's a patch (sorry, I can't think of anything else right now. :cool: ):

Why don't you keep on calling recv() until you get the "</html>" tag, instead of waiting for your ecv() call to return 0?

We're not sure here who's at fault... is it your code, or is the html server you're connecting to stops sending data sometimes? *grin* just my wild guesses.

monkeyboi
12-17-2002, 12:24 AM
your patch could work but what if the site forget to include the "</html>" tag which will trap a the program in an infinite loop..

i think is something wrong with my code cuz the real web browser seem to be fine with those webserver.....

Spawn913
12-17-2002, 02:56 AM
Change this part of your code:


while ((a = recv(sockdesc, buf,bufsize,0)) >0)
{
strcat(buffer, buf);
numbytes +=a;
}


to:


while ((a = recv(sockdesc, buf,bufsize,0)) >0)
{
strncat(buffer, buf, a);
numbytes +=a;
buffer[numbytes] = '\0';
}


there is a difference between strcat and strncat. strcat concatenates bytes until it reaches a null terminator ('\0') -- which may not be present in the data you are getting from recv(). strncat makes sure that only <a> bytes are concatenated. goon12 has a point, but the length part of his call to strncat() should have been the return value of the recv() call since recv() doesn't always get bufsize bytes.

goon12
12-17-2002, 10:54 AM
Have you tried using read() instead if recv() ? The return value of read() the number of bytes read.


int z;

z = read(sockdesc, &buf, sizeof(buf));
buf[z] = '\0';
strncat(buffer, buf, z);




-goon12

bwkaz
12-17-2002, 11:26 AM
Yeah, I'd say use read() as well, instead of recv().

With read(), you can even go one character at a time until the socket gets closed (which happens when the transfer is done). Which would give you the most control...

Code tags are [ code ] and [ /code ], but without the spaces.

Spawn913
12-17-2002, 09:34 PM
... recv() also returns the number of bytes received. Although personally, I use read() as well.

:cool:

monkeyboi
12-17-2002, 10:16 PM
strncat() and read() still produce the same result as the strcat and recv().

anyway i pin pointed the problem....

it's the the problem of strcat() function, when i added up the total bytes received from the webserver is too small compare to the actual content length indicated by the server.... try www.buysell.com i can never get more than 2760 bytes from it when the actual content is 15006 bytes....

Spawn913
12-17-2002, 10:26 PM
the changes I mentioned are valid, and they work (at least in my tests). When I tried www.buysell.com, and after adding an error check:


....
<recv() loop>
...

if (a<0) {
perror("Recv");
}


I get a "Connection reset by peer" error. I guess what you need to add to your code is some error checking to ensure that your connection to the server is still up.

monkeyboi
12-17-2002, 10:29 PM
weird but not on my comp... max i get is 2760 bytes even with changes u made...

and i found out one more thing.. it only has that problem with site running IIS not apache....

Spawn913
12-17-2002, 10:33 PM
oh sorry about the confusion, the site didn't work also on my comp with the changes I made.

I just checked why it didn't (via checking the return value of recv() when it exited the loop), and thus came up with the "Connection reset by peer" error.

what you need to determine now is how to recover from such an error.

monkeyboi
12-17-2002, 10:40 PM
i never got that error before tho...

bastard23
12-17-2002, 11:15 PM
Try changing strcat to memcpy i.e.:

memcpy(buffer+numbytes,buf,a);

You shouldn't be depending on \0 bytes from the HTTP data to terminate you strings. Otherwise you are depending on the NULs already in memory to end your c string. After the loop, you should:
*(buffer+numbytes)="\0";
so you can use it as a regular c string.

As spawn mentioned, you should be checking "a" for -1 as well. Also, make sure you don't overrun you buffer, you should make sure newbytes+a isn't bigger than what you allocated before the memcpy. There are tons of little gotchas here.

Good luck,
chris

PS www.linuxnewbie.org was a little short with your code.

monkeyboi
12-17-2002, 11:22 PM
well this updated code works well with 80% of the site.. i tried www.linuxnewbie.org and it receives the full content

/*
i got it up to this point where it will work fine half of the time....
give it a try yourself
gcc test.c -o test
to run
chmod 755 test
./test www.yahoo.com


monkeyboi
*/

#include <stdio.h>
#include <netdb.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <string.h>


#define bufsize 102400


int main(int argc, char *argv[])
{

struct sockaddr_in sin;
struct hostent *host;

int sockdesc, numbytes, i, a;
char buf[bufsize], buffer[999999], full_path[500];
sin.sin_family = AF_INET;
sin.sin_port = htons(80);



host = gethostbyname(argv[1]);

if (host)
{
memcpy((caddr_t)&sin.sin_addr.s_addr, host->h_addr, host->h_length);
}
else
{
perror("Resolving");
exit(-1);
}



strcpy(full_path, "GET / HTTP/1.0\n\n ");



sockdesc = socket(AF_INET, SOCK_STREAM, 0);


if (sockdesc < 0)
{
printf("socket error!");
return(-1);
}


if (connect(sockdesc, (struct sockaddr *)&sin, sizeof(sin)) < 0)
{
perror("connect");
close(sockdesc);
return(-1);

}


if(send(sockdesc, full_path, strlen(full_path), MSG_OOB) <0)
{

printf("Error sending packet\n");
return(-1);

}


numbytes = 0;
strcpy(buffer, "");


while ((a = recv(sockdesc, buf,bufsize,0)) >0)
{
if (a<0) perror("Recv");
strncat(buffer, buf, a);
numbytes +=a;
buffer[numbytes] = '\0';
printf("%i\n",a);

}

//buffer[numbytes] = '\0';
//print out the html content
printf("\n%s\n__________________________________\n\n", buffer);
//total bytes received
printf("Total received length: %i bytes\n", numbytes);

close(sockdesc);

return 0;
}

bastard23
12-17-2002, 11:29 PM
monkeyboi,

Put if (a<0) perror("Recv"); outside of the while loop. The while breaks if you get an -1 return value.

You might want to consider using memcpy, as what you are dealing with isn't really a C null terminated string. No reason to make it one until you want to make it one. But what ever works for you.

Good Luck,
chris

monkeyboi
12-17-2002, 11:32 PM
damn didnt forgot i should put the perror outside loop otherwise it will never get execute when is -1
heheh thx man

monkeyboi
12-17-2002, 11:34 PM
yap i got that error : Connection reset by peer..
this lead to a newest problem. fukkk

Spawn913
12-18-2002, 01:56 AM
memcpy() and strncat would do the same in this case, IMHO. since he placed the line:


while ((a = recv(sockdesc, buf,bufsize,0)) >0)
{
if (a<0) perror("Recv");
strncat(buffer, buf, a);
numbytes +=a;
buffer[numbytes] = '\0';
printf("%i\n",a);
}



anyway, i think your loop could be better if you did it something like this:


for ( ; ; ) {
a = recv(sockdesc, buf,bufsize,0);
if (a<0) {
/* do some error correction */
/* ........ */
continue;
}
strncat(buffer, buf, a);
numbytes +=a;
buffer[numbytes] = '\0';
printf("%i\n",a);
}


This way you don't have to check 'a' after the loop but inside it, and then be abe to do something when an error occurs.

monkeyboi
12-18-2002, 02:28 AM
but then ur loop will never end..

anyway here is wut i concluded...
the problem exist only with sites that runs IIS webserver...
when the program connects to a site that use ISS, the IIS server will send back 3 packets (each packet is around 1kb) then disconnect... if the content length is like 20kb it will require the server to send 10 packets or more back to the client and IIS only send 3 and disconnect....
even if i reconnect it still send the first 3 packet of the same content....

lastly y can ppl jux use apache i mean it's like billion times better IIS and it's free..... damn bill gate..
hehehe got to take my anger on some1...
:)

peace!

Spawn913
12-18-2002, 04:20 AM
hmm... makes you wonder how the browsers get their html content, huh? :cool:

anyway, yes, the loop will never end in what i showed you. you'll have to fill in the

/* do some error correction */
/* ........ */

part to fix your code. If all you want is to break the loop, then just put a 'break' there and it will exit - similar to your original code.

monkeyboi
12-18-2002, 04:46 AM
hehehe yah i wonder how to...:)


ok it's still either never end or ends with error and break the loop
[code]

for (;;)
{
a = recv(...);

if (a < 0)
{
/*when a <0 that means the connection is terminated (well not mostly but in this case it is)
so there is no point for doin anything in here, well only break out of the loop*/

break;
}
//do the real strcat stuff
}

bastard23
12-18-2002, 07:11 AM
Spawn913,
Yes, it basically is doing the same thing, but it is slightly more efficient to do the memcpy, strncat has to rescan buffer and check for '\0' in buf. I also think in makes more sense in what you are doing. Not all http transfers are ASCII text. But if this is a limited domain program, it really isn't a problem. With memcpy, you may have to check buffer to see if it is really the whole "string" you want (strchr). Wouldn't strncat lose data if there were an extra NULL somewhere other than the end?

Again, whatever works.

Have fun,
chris

PS I'm up too late because I just watched the midnight show of The Two Towers. Awesome.

Spawn913
12-18-2002, 09:03 PM
PS I'm up too late because I just watched the midnight show of The Two Towers. Awesome.

Arg. I envy you. It's still not shown here.



Anyway... strncat() doesn't look for the ending '\0', but copies the number of bytes indicated in the function call, so having two NULLs shouldn't be a problem. Basically, it does the same thing as memcpy(), provided that the string to be appended to is NULL terminated.

bastard23
12-18-2002, 09:39 PM
Spawn913,
W/ strncat and there is a NULL in the middle of buf, then it won't copy the number of bytes you specify.
Here is my quick test code:

#include <string.h>
#include <stdio.h>

#define TESTSIZE 100
int main(){
char wnul[] = "Hello\0World";
char wonul[] = "Hello World";
char test[TESTSIZE];

memset(test,'\0',TESTSIZE);
strncat(test,wnul,TESTSIZE);
printf("%s\n",test);

strncat(test,wonul,strlen(wonul)+1);
printf("%s\n",test);

}

gcc test.c&&./a.out
output:
Hello
HelloHello World

So as long as you are only receiving text w/o nulls, you are OK. But this is something talking to the network, so you really should be careful. Also, I got weirdness if I didn't do the memset on test. I think that is because test will get allocated on the stack, but I didn't look into it.

Have fun,
chris

709394
12-19-2002, 09:05 PM
@monkeyboi
Do you know what is the block size that the host will send it back? If I change the size of the buffer[] in recv(), the number of bytes block received will change too. The total number of bytes received change each time that I execute the program. However, the file to be downloaded is the same each time.

I believe that the problem does not reside in "strncat" or string stuff. I think that you have to dictate the host how to send its packets and how much, if it is feasible. I would like to find it out too.

Does somebody know where I can find info on these manner? All sites that I found that talk about socket programming do not talk about the internet stuff but about private client-server, broadcast or UDP...

thanx!

monkeyboi
12-20-2002, 12:59 AM
yah there is nothing wron with strncat but the problem is that the webserver (only IIS) disconnects from my client after they send 3 packets...and some of contents bigger than 3 packet size (each packet size is around 1448bytes). the latest code will work find with apache and other webserver.... which i wil get the full contents of most sites....

i kno it's not IIS's fault cuz webbrowser doesn't hav anything problem getting the content with IIS.

i jux need to read more on socket programming and http protocol to find a solution to this problem....

thx for all yah help ppl :)

709394
01-06-2003, 01:49 AM
So, does somebody found the solution for the IIS server?
I search through google and everywhere it says:
GET / HTTP/1.0\nhost:theHostName\n\n. However, I have to confess, I didn't go through the RFC 261X document. Too long!

However, I found an alternative. Use the libwww library.
http://www.w3.org/Library/

Very easy to use!!

Still, I want to know the solution!!