2010-12-19

Introducing Data::Handle

Comming to a mirror near you, soon, is Data::Handle.

What does Data::Handle do?

Data::Handle solves 2 very simple problems that occur with the __DATA__ section and the associated *DATA Glob, and both of them are to do with "multiple modules trying to access the section".

1. Provide a reliable way to get a file-handle with the position at the start of the __DATA__ segment

  1. *DATA is really a pointer to the entire file, and not just the data segment
  2. The Perl interpreter sets the current position in the file to be after the __DATA__ line

The first time you read from *DATA this of course works fine, but the issue is once you read it, it moves the internal file cursor, and if you read the whole section, after the first complete read, the cursor now points to EOF. For a second block of code to re-read this data without communicating with the first block of code, it has to then rewind the file cursor back to the start prior to reading, and there is no way naturally to know where that point to rewind back to is.

Other modules so far have remedied this by trying to rewind to the start of the file, and manually emulate various parts of the Perl Parser to re-find the start of the __DATA__ section before re-reading its contents.

This module however takes a different approach, and assumes that hopefully, the first person to read that file handle will know what they're doing, and use this module to do it. This module will then record the file offset the __DATA__ section began at, so from that point onwards, rewinding to the start is a trivial exercise.

And all this happens for you simply by you doing :

my $handle = Data::Handle->new( __PACKAGE__ ); 

instead of doing

my $handle = do { no strict 'refs'; \*{ __PACKAGE__ . "::DATA"} };
. ( Note: Side perk, the new syntax is simpler, more straight forward, easier to remember, and no dicking around with strict! ;D )

2. Provide a reliable way for 2 separate logical code units to access the same __DATA__ segment without interfering with each other

Because *DATA is a filehandle, and there is only one of them, seeking around in it can be problematic.

Especially if you have 2 code units that are trying to read it from different places. For a contrived example, prior to this module if you wanted to go back and re-read the start of the section, or skip forwards and read something later in the section, without forgetting where you are now, you'd need a contrived dance of seek/tell. Instead, now, you can just create another worker that will read that stuff for you, and the original handle will retain its position.

my $handle = Data::Handle->new( 'Foo' );
while( <$handle> ){ 

   if ( $_ =~ /something/ ){ 
       # get line 1. 
       my $slave = Data::Handle->new('Foo');
       my $firstline = <$slave>;
       do_stuff_with_first_line($firstline);
   }
   
   # continue as normal.
}

Internally, there is a lovely dance of Seek() going on there, but from an interface perspective, you don't need to know its seeking, all you need to know is "Get reference to DATA, get data from it".

Sure, you can probably argue you could do it easily with lots of seek() in a nice way, but that logic falls apart when you have code in 2 separate places reading the same *DATA.

Its much smarter to be defensive about it, and have some assurance that you can read a file descriptor in a safe way without something evil like this tampering with it.


my $handle = do { no strict 'refs'; \*{ __PACKAGE__ . "::DATA"} };
while(<*DATA>){ 
   do_something_with_($_);
   evil_function();   
}

....
sub evil_function { 
  my $handle = do { no strict 'refs'; \*{ __PACKAGE__ . "::DATA"} };
  seek $handle, 0, 3; # seek to EOF.
}

That is spooky action at a distance!

Data::Handle solves this by meticulously tracking position in each instance, and re-seeking the file handle to the place it was at the end of the last tracked read, so regardless of how much seeking around some other module did, as long as you got on the scene first, you should be unstoppable ;)

6 comments:

  1. I haven't read your code in full yet, but wouldn't this be simpler:

    package Dh;

    use strict;
    use warnings;

    use Carp;

    our %offsets;
    sub new {
        my ($class, $package) = @_;
        unless (defined $package) {
            $package = "main"; #FIXME: maybe this should use caller
        }
        my $dh_name = "${package}::DATA";
        my $orig_dh = do { no strict 'refs'; \*{$dh_name} };
        open my $dh, "<&", $orig_dh
            or croak "could not dup $dh_name: $!";
        if (exists $offsets{$dh_name}) {
            seek $dh, $offsets{$dh_name}, 0;
        } else {
            $offsets{$dh_name} = tell $orig_dh;
        }

        return $dh;
    }

    1;

    ReplyDelete
  2. You can fdopen (or dup) DATA like this, instead of that ugly do hack:

    open my $fh, '<&=', "${package}::DATA"; # fdopen
    open my $fh, '<&', "${package}::DATA"; # dup

    ReplyDelete
  3. Or using IO::Handle's new_from_fd:

    my $fh = IO::Handle->new_from_fd( "${package}::DATA", "r");


    also, related to your module's _is_valid_data_tell function, __DATA__ can be preceded and followed by whitespace (so length "__DATA__\n" is not a valid length to check for and can be also spelled __END__ in the case of package main.

    ReplyDelete
  4. @Chas Owens , @Jamesw, I've tried your techniques ( the open variant at least ) listed, but they do not appear to work as you suggest.

    Both "<&" and "<&=" result in a file handle in which the underlying backing mechanisim is still very tightly coupled.

    I've added tests to my test-suite for the behaviours you suggest but I cannot get them working as you suggest they do.

    https://github.com/kentfredric/Data-Handle/blob/master/t/03_fdup_test.t

    ReplyDelete
  5. I'd gladly use a simpler technique if it were possible, by my exploration so far has not found me one :(

    ReplyDelete
  6. Unfortunately fdopening (and duping) DATA in versions of perl before 5.10 actually sets the file position to the end of the file.

    ReplyDelete