Click to See Complete Forum and Search --> : Handling BIG files in linux


Hena
09-10-2001, 02:34 AM
I have a following problem. I have a BIG text file and i i'm supposed to reformat the file pretty much. Now since this file is 3.1Gb (all ascii text) i'm obviously not going to do this by hand. I did a small program to do this reformatting of the file. However when i run it, it gives out core dump. Now i know my program works, since it has worked in smaller files. So any ideas what causes this and how to get over it. I'm using 2.4.2 kernel.

berb
09-10-2001, 04:37 AM
what is free space on hard drive? also, it could have something to do with the /tmp limitations - more of a guru area.
what does the code do, ie does it write it to a temporary file or a new file and is there enough room there to do so.

also - what is the information from the core dump, is it from your proggy??

Hena
09-10-2001, 05:02 AM
what is free space on hard drive? also, it could have something to do with the /tmp limitations

I have one 33Gb hard drive, with atleast 29Gb free. It is mounted on new mount point and i dont think it has limitations (unless they are automatic).

what does the code do, ie does it write it to a temporary file or a new file and is there enough room there to do so.

It reads the file to a certain point in text and then outputs the formatted version to output-file. Then it does so again and again until EOF. I don't use any temp files in process. And there was more than enough hd space. It collapses instantly so, it doesn't write anything in my output-file.

also - what is the information from the core dump, is it from your proggy??

It is not in text format, so i don't know whats in it. It looked like binary files when opened (lots of weird characters and whatnot). I know my program can do atleast half Mb file without any problems.

Craig McPherson
09-10-2001, 05:43 AM
I can't give you a definite answer, but I know that getting large file support in Linux is a female dog. The general limit of 2GB has been with us forever. To get around it, both the kernel, the filesystem, the C library, and sometimes the applications have to properly support large files.

You'd be best off doing a Google search (http://www.google.com/search?q=Linux+large+file+support). I browsed through the first site that appears on the search, and it had some pretty good information that might get you started.

Good luck.

Malakin
09-10-2001, 05:46 AM
it's "gdb -c core program" to examine a core file (I think), core being the core file and program being the name of the program. (bt to do a backtrace). Or you could always run your program under gdb and watch it crash.

A possible solution to your problem though, if this is just a one time thing you could try using "split" to split up your text file into a bunch of different pieces, run your program on each piece and then stick them back together, might take a little bit of manual work at the seams but it shouldn't take long, especially if it will work with only two pieces. "cat" to put them back together.

[ 10 September 2001: Message edited by: Malakin ]

DMR
09-10-2001, 06:13 AM
Hena,

As Craig suggested, you're probaby running into the 2G filesize limitation of glibc. Do a search at Google (http://www.google.com) using the keywords "glibc" and "2G" for more info.

Hena
09-10-2001, 06:48 AM
I have 2.4.x kernel, so it should have lfs for ext2 filesystem (i'm using it on that hd). My glibc is 2.2.2-10 version, so that should support all features in kernel.

The third point about my own programs however did have some effect. So in order for my own program to be able to handle over 2Gb files it needs to be compiled with the switches mentioned in this (http://www.suse.de/~aj/linux_lfs.html)
web page? After those switches instead getting empty output-file i get 1.3Mb sized output-file (from .5Mb file i got 1.2Mb). With some errors in it's output for some reason. String has now been filled properly. Off-topic thingy here, in c is it allowed to be 50000-90000 char long strings and could it cause problems :)

I don't seem to have a gdb command. This is program of some sort... Yes it is, got it now and testing it soon.

z911
09-10-2001, 07:29 AM
Hena,

I have some experience in handling very large files of the type you've been trying to process. Whether ASCII or binary, a quick and dirty way to do this without any fancy compiling switches and debugging is to read the file one line at a time and write the output one line at a time. This way you'll only have to declare one line worth of variables requiring the absolute minimum memory allocation.

I am not sure whether this will get around the 2Gb limit but I guess if you can edit your file under Linux (vi, gedit, kedit,...) then you should be able to take this shortcut. Any language should do (c, c++, FORTRAN, Java,...).

Good luck :)

Hena
09-10-2001, 07:51 AM
Got gdb and run it.

GNU gdb 5.0
Copyright 2000 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux"...
Core was generated by `./dna2aa2 ensembl-1.1.0_masked_golden_path ensembl-genome.fa2'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /lib/i686/libc.so.6...done.
Loaded symbols for /lib/i686/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
#0 0x4009617f in _IO_fflush (fp=0xbffe98e0) at iofflush.c:44
44 iofflush.c: No such file or directory.


So does now this mean that my program caused a seg fault? If this is so, then i think i going to have to try recoding it so, that it uses a temporary file for reading the large file in small pieces. Oh well, back to c...