I was reading another blog about somebody intending to analyse what amount of perl code constitutes as "Line Noise", but they didn't appear to have Actually Done It.
I took a naïve approach and didn't make any assumptions about what "line noise" constitutes, and just did basic statistics on the prevalence of various characters for the sake of interest.
0.2 % : 511319 x char 64 : "\@" 0.2 % : 564540 x char 55 : 7 0.2 % : 593117 x char 79 : "O" 0.2 % : 601710 x char 77 : "M" 0.3 % : 675072 x char 92 : "\\" 0.3 % : 684986 x char 68 : "D" 0.3 % : 698665 x char 78 : "N" 0.3 % : 709768 x char 76 : "L" 0.3 % : 712074 x char 80 : "P" 0.3 % : 763426 x char 56 : 8 0.3 % : 784577 x char 107 : "k" 0.3 % : 797560 x char 82 : "R" 0.3 % : 833723 x char 54 : 6 0.4 % : 912737 x char 52 : 4 0.4 % : 920716 x char 93 : "]" 0.4 % : 921001 x char 91 : "[" 0.4 % : 924075 x char 73 : "I" 0.4 % : 947539 x char 118 : "v" 0.4 % : 956653 x char 67 : "C" 0.4 % : 996323 x char 65 : "A" 0.4 % : 1000637 x char 83 : "S" 0.5 % : 1125435 x char 119 : "w" 0.5 % : 1151874 x char 46 : "." 0.5 % : 1220735 x char 34 : "\"" 0.5 % : 1222341 x char 9 : "\t" 0.5 % : 1222927 x char 51 : 3 0.5 % : 1241600 x char 69 : "E" 0.5 % : 1243448 x char 53 : 5 0.5 % : 1332828 x char 84 : "T" 0.6 % : 1443662 x char 57 : 9 0.6 % : 1491434 x char 120 : "x" 0.6 % : 1499376 x char 125 : "}" 0.6 % : 1500792 x char 123 : "{" 0.7 % : 1718028 x char 103 : "g" 0.7 % : 1739054 x char 40 : "(" 0.7 % : 1739695 x char 41 : ")" 0.7 % : 1792258 x char 59 : ";" 0.7 % : 1825133 x char 121 : "y" 0.8 % : 1837291 x char 98 : "b" 0.8 % : 1842316 x char 35 : "#" 0.8 % : 1960600 x char 50 : 2 0.9 % : 2149806 x char 62 : ">" 1.0 % : 2410416 x char 49 : 1 1.1 % : 2594921 x char 61 : "=" 1.1 % : 2684166 x char 95 : "_" 1.1 % : 2709633 x char 112 : "p" 1.2 % : 2818643 x char 58 : ":" 1.2 % : 2952175 x char 104 : "h" 1.2 % : 2995621 x char 45 : "-" 1.3 % : 3151943 x char 109 : "m" 1.3 % : 3283418 x char 36 : "\$" 1.3 % : 3291138 x char 102 : "f" 1.4 % : 3339529 x char 39 : "'" 1.4 % : 3355931 x char 117 : "u" 1.5 % : 3638254 x char 99 : "c" 1.6 % : 4016055 x char 100 : "d" 1.9 % : 4598003 x char 44 : "," 2.0 % : 4786703 x char 108 : "l" 2.2 % : 5472272 x char 48 : 0 2.6 % : 6279579 x char 110 : "n" 2.6 % : 6306811 x char 111 : "o" 2.7 % : 6625715 x char 105 : "i" 2.8 % : 6872608 x char 114 : "r" 3.0 % : 7315145 x char 115 : "s" 3.1 % : 7522087 x char 97 : "a" 3.6 % : 8711403 x char 10 : "\n" 3.7 % : 8972142 x char 116 : "t" 5.4 % : 13289205 x char 101 : "e" 24.2 % : 59186425 x char 32 : " "
I find it quite intriguing how the various bracketings are unbalanced. Also the significantly greater use of ">" vs "<" indicates people write more than they read.Edit: probably more =>
Also, what is extremely amusing, is in this sort order, ignoring "r" "a" and "t" and all whitespace going down, a word is formed. That word.... is "noise". Weird.
For a full dump of my diagnositcs, see my github gist
The code I used to generate these stats is pretty straight forward, and would be interested in seeing what sort of results other people get, and possibly the result of adapting the code to work for C and other non-perl languages to work out how much "line noise" they are.
#!/usr/bin/perl use strict; use warnings; use 5.12.1; use File::Find::Rule (); use File::Find::Rule::Perl (); use Data::Dumper qw( Dumper ); say $_ for ( @INC ); my @pmfiles = File::Find::Rule->perl_file->in( @INC ); my %stats; for my $file ( @pmfiles ){ say "scanning $file"; open my $fh, '<', $file or next; my $char; while( read $fh, $char, 1 ){ $stats{$char}++; } # last; } my @data = sort { $a->[0] <=> $b->[0] } map { [ $stats{$_} , $_ ] } keys %stats; $Data::Dumper::Terse = 1; $Data::Dumper::Useqq = 1; my $numchars; $numchars += $_ for values %stats; for( @data ){ printf "%5.1f %% : %8d x char %4d : %s" , ( $_->[0] / $numchars * 100 ) , $_->[0] , ord( $_->[1] ), Dumper( $_->[1] ); }
ETASRIONDCUF is reasonably close to ETAOINSHRDLU :)
ReplyDelete