[parted-devel] converting leading TABs to leading spaces: summary

Jim Meyering jim at meyering.net
Mon Mar 12 18:16:47 CET 2007


I wrote a script to do some analysis of what
it'd take to convert leading TABs to spaces
or leading spaces to TABs (depending on what the
majority of the file is already doing).

I ran the script with its relatively arbitrary defaults (see below),
and here are the results:

    find . -name '*.c' -type f -print0 \
        | ./indentation-normalize -files0-from=-

Here's the summary part of its output:

  # lines affected/ws-prefixed lines: 3783/31330
  # files affected: 76/80

  n:  1 #unchanged: already have TAB-based indentation
  N:  3 #unchanged: already perfect
  ^: 11 #files changed to have all-spaces indentation
  _: 65 #files changed to have TAB-based indentation

The point is that if we're going to be changing so many lines already (more
than 1 in 10), just to get leading white space per-file consistent, then
it's worth considering whether to go all the way and change formatting, too.

-----------------
Here's the script:

Note the defaults for a given file:
  - if 40% of the lines with leading white space are space-indented, then
    convert the remaining 60% or less to be indented with spaces, not TABs.
  - otherwise, convert the <40% of space-indented lines to be indented
    with TABs

-------------- next part --------------
#! /usr/bin/perl

my $VERSION = '2007-03-12 17:15'; # UTC

# For each file considered, determine whether it uses mostly TABs or
# mostly spaces for indentation.
# Compute some basic statistics given a what-if scenario, involving
# converting from all-spaces indentation to TAB-based indentation.

# FIXME (some other day): this is extremely naive:
# If a line starts with a single space, then it doesn't matter if
# the following byte is a TAB, it's counted as "properly space-indented".
# Likewise, if it starts with a single TAB, even if the following bytes
# are SP TAB SP TAB, it's still counted as "properly TAB-indented".

use strict;
use warnings;
use Getopt::Long;

(my $ME = $0) =~ s|.*/||;

# See documentation below.
my $N_BAD_LINES = 20;

# See documentation below.
my $REL_GOOD_LINES_THRESHOLD = .40;

# See documentation below.
my $REL_BAD_LINES_THRESHOLD = .60;

# The $REL_BAD_... + $REL_GOOD_... may be greater than 1.0,
# in which case, (e.g., .90, .90) files will be changed only if
# fewer than 10% of white-space-prefixed lines are different from
# the remaining white-space-prefixed lines.

my %CVT_tab =
  (
   NO_CHANGE__NO_CANDIDATE_LINES =>
     ['0', '#unchanged: no candidate lines'],
   NO_CHANGE__ALREADY_PERFECT =>
     ['N', '#unchanged: already perfect'],
   NO_CHANGE__ALREADY_ALL_BAD =>
     ['n', '#unchanged: already have TAB-based indentation'],
   NO_CHANGE__TOO_MIXED =>
     ['*', '#unchanged; too mixed'],
   N_BAD_TO_GOOD =>
     ['^', '#files changed to have all-spaces indentation'],
   N_GOOD_TO_BAD =>
     ['_', '#files changed to have TAB-based indentation'],
  );

sub marker($)
{
  my ($s) = @_;
  return $CVT_tab{$s}->[0];
}

sub msg($)
{
  my ($s) = @_;
  return $CVT_tab{$s}->[1];
}

my $total_n_lines_affected = 0;
my $total_n_files_affected = 0;
my $total_n_lines = 0;
my $total_n_files = 0;
my %stats;

sub usage ($)
{
  my ($exit_code) = @_;
  my $STREAM = ($exit_code == 0 ? *STDOUT : *STDERR);
  if ($exit_code != 0)
    {
      print $STREAM "Try `$ME --help' for more information.\n";
    }
  else
    {
      eval 'use Pod::PlainText';
      die $@ if $@;
      my $parser = Pod::PlainText->new (sentence => 1, width => 78);
      # Read POD from __END__ (below) and write to STDOUT.
      *STDIN = *DATA;
      $parser->parse_from_filehandle;
    }
  exit $exit_code;
}

# What to do:
# - convert bad indentation to good
# - convert good (e.g., leading spaces) to bad (e.g., TABs)
# - make no change
# Return a pair ($what_to_do, $n_lines_affected).
sub decide ($$$$$)
{
  my ($n, $n_good, $n_bad, $good_frac, $bad_frac) = @_;

  $$good_frac = 0;
  $$bad_frac = 0;
  $n == 0
    and return ('NO_CHANGE__NO_CANDIDATE_LINES', 0);

  $n_bad == 0
    and return ('NO_CHANGE__ALREADY_PERFECT', 0);

  $n_good == 0
    and return ('NO_CHANGE__ALREADY_ALL_BAD', 0);

  $n_bad <= $N_BAD_LINES
    and return ('N_BAD_TO_GOOD', $n_bad);

  $$good_frac = $n_good / $n;
  $REL_GOOD_LINES_THRESHOLD <= $$good_frac
    and return ('N_BAD_TO_GOOD', $n_bad);

  $$bad_frac = $n_bad / $n;
  $REL_BAD_LINES_THRESHOLD <= $$bad_frac
    and return ('N_GOOD_TO_BAD', $n_good);

  return ('NO_CHANGE__TOO_MIXED', 0);
}

sub process_file ($)
{
  my ($file) = @_;

  open FH, '<', $file
    or die "$ME: can't open `$file' for reading: $!\n";
  my @lines = <FH>;
  close FH;

  my $n_white_space_lines = 0;
  my $n_tab_lines = 0;
  my $n_space_lines = 0;
  foreach my $line (@lines)
    {
      $line =~ /^[\t ]/
        and ++$n_white_space_lines;
      $line =~ /^\t/
        and ++$n_tab_lines;
      $line =~ /^ /
        and ++$n_space_lines;
    }
  $total_n_lines += $n_white_space_lines;

  my $n_space_frac;
  my $n_tab_frac;
  my ($do, $n_affected) = decide ($n_white_space_lines,
                                  $n_space_lines,
                                  $n_tab_lines,
                                  \$n_space_frac,
                                  \$n_tab_frac);
  $stats{$do} ||= 0;
  $stats{$do}++;

  my $marker = marker $do;
  $n_affected
    and print "$marker $file $n_affected\n";
  if ($n_affected == 0)
    {
      printf "$marker %s (%d/%d) frac_sp %.2f\n", $file, $n_space_lines,
        $n_white_space_lines, $n_space_frac;
    }
  $total_n_lines_affected += $n_affected;
  $n_affected
    and ++$total_n_files_affected;
  ++$total_n_files;
  return;
}

{
  my $files0_from;
  GetOptions
    (
     'n-bad-lines=i' => \$N_BAD_LINES,
     'rel-good-lines-threshold=f' => \$REL_GOOD_LINES_THRESHOLD,
     'rel-bad-lines-threshold=f' => \$REL_BAD_LINES_THRESHOLD,
     'files0-from=s' => \$files0_from,
     help => sub { usage 0 },
     version => sub { print "$ME version $VERSION\n"; exit },
    ) or usage 1;

  my $fail = 0;
  foreach my $t ($REL_BAD_LINES_THRESHOLD, $REL_GOOD_LINES_THRESHOLD)
    {
      0 <= $t && $t <= 1.0
        or (warn "$ME: invalid argument: $t (out of range [0,1])\n"),
          $fail = 1;
    }

  $fail
    and exit 1;

  my @files;
  if ($files0_from)
    {
      1 <= @ARGV
        and warn "$ME: warning: ignoring command line argument(s)\n";

      my $file = $files0_from;
      if ($file eq '-')
        {
          *FH = *STDIN;
        }
      else
        {
          open FH, '<', $file
            or die "$ME: can't open `$file' for reading: $!\n";
        }
      $/ = "\0";
      while (defined (my $line = <FH>))
        {
          chomp $line;
          push @files, $line;
        }
      close FH;
      $/ = "\n";
    }
  else
    {
      @ARGV < 1
        and (warn "$ME: missing FILE argument\n"), usage 1;
      @files = @ARGV;
    }

  foreach my $file (@files)
    {
      process_file $file;
    }

  print "# lines affected/ws-prefixed lines:"
    . " $total_n_lines_affected/$total_n_lines\n";
  print "# files affected: $total_n_files_affected/$total_n_files\n";

  print "\n";
  foreach my $k (sort keys %stats)
    {
      my $msg = msg $k;
      print marker($k), ": $stats{$k} $msg\n";
    }

  exit 0;
}


__END__

###############################################################################
#
# Documentation
#

=head1  NAME

indentation-normalize - analyze space vs. TAB-based indentation

=head1  SYNOPSIS

indentation-normalize [B<OPTIONS>]... FILE...

=head1  DESCRIPTION

B<indentation-normalize> analyzes space- vs. TAB-based indentation,
with an eye to converting from one style to another.

=head1  OPTIONS

=over 4

=item B<--files0-from=FILE>

Read a list of NUL-separated file names from B<FILE>.
If B<FILE> is B<->, then read the list from standard input.

=item B<--n-bad-lines=N>

If there are no more than this many "bad" lines, then change those.

=item B<--rel-good-lines-threshold=FRACTION>

If the fraction of "good" lines (versus total white-space-prefixed lines)
is at least this large, then convert the remaining bad lines.

=item B<--rel-bad-lines-threshold=FRACTION>

If the fraction of "bad" lines (versus total white-space-prefixed lines)
is at least this large, then convert the remaining ("good") lines to
be consistent with the "bad" ones.

=item B<--help>

Emit usage hints.

=item B<--version>

Display program version.

=head1  EXAMPLES

To operate on all C source files in the current directory,
use this command:

  find . -name '*.[ch]' -type f -print0 \
    | ./indentation-normalize -files0-from=-

=back

=head1  AUTHOR

Jim Meyering <jim at meyering.net>

=cut

## Local Variables:
## indent-tabs-mode: nil
## perl-indent-level: 2
## perl-continued-statement-offset: 2
## perl-continued-brace-offset: 0
## perl-brace-offset: 0
## perl-brace-imaginary-offset: 0
## perl-label-offset: -2
## cperl-indent-level: 2
## cperl-brace-offset: 0
## cperl-continued-brace-offset: 0
## cperl-label-offset: -2
## cperl-extra-newline-before-brace: t
## cperl-merge-trailing-else: nil
## cperl-continued-statement-offset: 2
## eval: (add-hook 'write-file-hooks 'time-stamp)
## time-stamp-start: "my $VERSION = '"
## time-stamp-format: "%:y-%02m-%02d %02H:%02M"
## time-stamp-time-zone: "UTC"
## time-stamp-end: "'; # UTC"
## End:


More information about the parted-devel mailing list