2010-06-25

Some basic statistics on "Line Noise"

I was reading another blog about somebody intending to analyse what amount of perl code constitutes as "Line Noise", but they didn't appear to have Actually Done It.

I took a naïve approach and didn't make any assumptions about what "line noise" constitutes, and just did basic statistics on the prevalence of various characters for the sake of interest.

Partial Dump
  0.2 % :   511319 x char   64 : "\@"
  0.2 % :   564540 x char   55 : 7
  0.2 % :   593117 x char   79 : "O"
  0.2 % :   601710 x char   77 : "M"
  0.3 % :   675072 x char   92 : "\\"
  0.3 % :   684986 x char   68 : "D"
  0.3 % :   698665 x char   78 : "N"
  0.3 % :   709768 x char   76 : "L"
  0.3 % :   712074 x char   80 : "P"
  0.3 % :   763426 x char   56 : 8
  0.3 % :   784577 x char  107 : "k"
  0.3 % :   797560 x char   82 : "R"
  0.3 % :   833723 x char   54 : 6
  0.4 % :   912737 x char   52 : 4
  0.4 % :   920716 x char   93 : "]"
  0.4 % :   921001 x char   91 : "["
  0.4 % :   924075 x char   73 : "I"
  0.4 % :   947539 x char  118 : "v"
  0.4 % :   956653 x char   67 : "C"
  0.4 % :   996323 x char   65 : "A"
  0.4 % :  1000637 x char   83 : "S"
  0.5 % :  1125435 x char  119 : "w"
  0.5 % :  1151874 x char   46 : "."
  0.5 % :  1220735 x char   34 : "\""
  0.5 % :  1222341 x char    9 : "\t"
  0.5 % :  1222927 x char   51 : 3
  0.5 % :  1241600 x char   69 : "E"
  0.5 % :  1243448 x char   53 : 5
  0.5 % :  1332828 x char   84 : "T"
  0.6 % :  1443662 x char   57 : 9
  0.6 % :  1491434 x char  120 : "x"
  0.6 % :  1499376 x char  125 : "}"
  0.6 % :  1500792 x char  123 : "{"
  0.7 % :  1718028 x char  103 : "g"
  0.7 % :  1739054 x char   40 : "("
  0.7 % :  1739695 x char   41 : ")"
  0.7 % :  1792258 x char   59 : ";"
  0.7 % :  1825133 x char  121 : "y"
  0.8 % :  1837291 x char   98 : "b"
  0.8 % :  1842316 x char   35 : "#"
  0.8 % :  1960600 x char   50 : 2
  0.9 % :  2149806 x char   62 : ">"
  1.0 % :  2410416 x char   49 : 1
  1.1 % :  2594921 x char   61 : "="
  1.1 % :  2684166 x char   95 : "_"
  1.1 % :  2709633 x char  112 : "p"
  1.2 % :  2818643 x char   58 : ":"
  1.2 % :  2952175 x char  104 : "h"
  1.2 % :  2995621 x char   45 : "-"
  1.3 % :  3151943 x char  109 : "m"
  1.3 % :  3283418 x char   36 : "\$"
  1.3 % :  3291138 x char  102 : "f"
  1.4 % :  3339529 x char   39 : "'"
  1.4 % :  3355931 x char  117 : "u"
  1.5 % :  3638254 x char   99 : "c"
  1.6 % :  4016055 x char  100 : "d"
  1.9 % :  4598003 x char   44 : ","
  2.0 % :  4786703 x char  108 : "l"
  2.2 % :  5472272 x char   48 : 0
  2.6 % :  6279579 x char  110 : "n"
  2.6 % :  6306811 x char  111 : "o"
  2.7 % :  6625715 x char  105 : "i"
  2.8 % :  6872608 x char  114 : "r"
  3.0 % :  7315145 x char  115 : "s"
  3.1 % :  7522087 x char   97 : "a"
  3.6 % :  8711403 x char   10 : "\n"
  3.7 % :  8972142 x char  116 : "t"
  5.4 % : 13289205 x char  101 : "e"
 24.2 % : 59186425 x char   32 : " "

I find it quite intriguing how the various bracketings are unbalanced. Also the significantly greater use of ">" vs "<" indicates people write more than they read.Edit: probably more =>

Also, what is extremely amusing, is in this sort order, ignoring "r" "a" and "t" and all whitespace going down, a word is formed. That word.... is "noise". Weird.

For a full dump of my diagnositcs, see my github gist

The code I used to generate these stats is pretty straight forward, and would be interested in seeing what sort of results other people get, and possibly the result of adapting the code to work for C and other non-perl languages to work out how much "line noise" they are.

#!/usr/bin/perl
use strict;
use warnings;

use 5.12.1;
use File::Find::Rule            ();
use File::Find::Rule::Perl      ();
use Data::Dumper                qw( Dumper );

say $_ for ( @INC );

my @pmfiles = File::Find::Rule->perl_file->in( @INC );

my %stats;

for my $file ( @pmfiles ){
    say "scanning $file";
    open my $fh, '<', $file or next;
    my $char;
    while( read $fh, $char, 1 ){
        $stats{$char}++;
    }
#    last;
}

my @data = sort { $a->[0] <=> $b->[0] } map { [ $stats{$_} , $_ ] } keys %stats;

$Data::Dumper::Terse = 1;
$Data::Dumper::Useqq = 1;

my $numchars;
$numchars += $_ for values %stats;

for( @data ){
    printf "%5.1f %% : %8d x char %4d : %s" ,
       ( $_->[0] / $numchars * 100 ) , 
       $_->[0] , 
       ord( $_->[1] ),
       Dumper( $_->[1] );
}

1 comment:

  1. ETASRIONDCUF is reasonably close to ETAOINSHRDLU :)

    ReplyDelete