The problem of opening large files
Arnaud Taddei
Arnaud.Taddei at sun.com
Sun Mar 24 22:01:55 CET 2002
Recently I showed some customers on what kind of things they could get from an
LDAP log using my own scripts (not yet in Lire) in order to make sure that my
ideas are ok before we can introduce an LDAP superservice.
I ran into the problem that my program had to read a 2GB log file for 3 days of
LDAP log and it failed with
Cannot open file: Value too large for defined data type at
/homedir/a/admin/project/bin/testopen line 6, <IN> chunk 1.
The code was simply:
open(F, $file) || die "Cannot read $file: $!";
so I was stuck because I have no infrastructure to split the input file and I
was squeezed by time as I had to provide results. After some investigation I
discovered that the sysopen call requires to copy the data in the system which
means that no way this approach will NEVER work in the general term. So how some
programs like less, or cat can open a file that large? Because they use MMAP. As
I couldn't find a single way to do that in perl, I simply got the idea that
maybe reading on a pipe would be much better and indeed it worked:
open(F, "cat $file) || die "Cannot read from cat $file pipe: $!";
worked fine (ok you might or not want to use the ' $| = 1; ' setting before
runnning it).
This lead me to think that:
1) You might face the same problem in Lire so maybe you want to consider an
alternative way for opening large files rather than using open in the classical
way
2) This lead me to think about another issue which is:
"What if we want to read multiple log files which are representing a
continuation log for several months"
Well: if we want to append all of these files together we know now that we could
use some trick like the one above to open the file BUT we will face plenty of
other limitations. Example:
- memory tables could explode
- the sort command (that you guys are using) might fail for actually other space
reasons
- etc.
On the other hand we have a generic problem of log files 'continuation'. When I
start to read a log file several lines are going to be missing that are in the
previous log. Examples:
- the sendmail or NMS accept line that shows which is the originator, client
relay, etc.
- in LDAP: an LDAP session might start weeks before the log we are currently
looking at because of LDAP connection pools.
- in IMAP/POP: a session might have been started days before containing the IP
address of the client machine
All of this leads me to think that:
1) We have orphan lines 'at the beginning' of a log file and unfinished sessions
'at the end' of a log file.
2) we should have the concept of a persistent continuation representation that
will keep the orphan and unfinished lines such that we can re-append them or
re-prepend them at the beginning and end of the log file we are looking for.
This could be of the form of a key:line pair file that could be read by the
current 'log2dlf' program and that could try to reconciliate any line which
looks like unfinished or orphan.
Let me know what you think about these ideas. I think we can do a more accurate
job than what we do actually. I can tell you that I have access to a very messy
set of log files from 6 months ago and such a tool would considerably help
making sure we have an easier way to minimise the number of unperfections of
current logging systems.
A++
--
To UNSUBSCRIBE, email to development-request at logreport.org with a subject of
"unsubscribe". Trouble? Send an email with subject "help" to
development-request at logreport.org
More information about the Development
mailing list