Considering String Allocation and Copying in Tla 1.x
  
  Is String Copying in libawk a Problem Worth Fixing?
  Hypothesis:
Excessive string copying is a significant (attention deserving) performance problem in
tla, and most of the problem is caused by the behavior oflibawk.
Background
    libawk in tla provides two data structures used throughout most
  of tla: a relational table (a 2-d, dynamically sized, array of
  strings) and an associative table (a mutable associative mapping
  of strings to strings).  Some basic relational operators are
  provided for the tables.
  
    These data structures are represented quite simply.  The relational
  tables are dynamically allocated arrays of pointers (each element a
  table row), pointing to dynamically allocated arrays of pointers
  (each element a column within one row), pointing to dynamically
  allocated 0-terminated strings of unsigned char values.  The
  associative tables are simple hash tables, with keys and values
  represented by privately allocated, 0-terminated strings of unsigned
  char values.
  
    A noteworthy property of the libawk implementation is that
  operations which create a new table (relational or associative) must
  newly allocate copies of all of the strings in that table,
  regardless of where the original data comes from.
  
    For example: a join operation accepts two tables as input and
  produces as output a third table.  All strings in the output table
  have identical contents to some string in one or both of the input
  tables, but join as currently implemented allocates a new copy of
  those strings anyway.
  
    Another example: suppose that I want two copies of a single table,
  each sorted using a different column as the sort key (as is common
  in some tla operations).  Each string in either table is
  duplicated in the other, the duplicate being a separately allocated
  copy.
  
Envelope Calculations and Anecdotal Profiling/Tuning Results
To put some rough numbers to this, I examined an arch import of the linux kernel prepared by John Goerzen. I computed an inventory:
      % tla inventory --source --both --ids > ,foo
  
  which produces a list of all source files which are not arch control files, listing the relative path to each file along with that files inventory id, one such pair per line. This listing disregards all arch control files on the grounds that, in theory, arch might be modified to shrink the list of such files by a large amount.
The size of the resulting listing is a reasonable approximation of the amount of memory that must be dynamically allocated to create an in-core table containing the inventory. It is also approximately the amount of memory needed to allocate the keys and values of an associative table mapping file paths to ids or vice versa.
    The number of lines in that listing is basically half the number of
  calls to malloc needed to create such a table and half the number
  of calls to free needed to free such a table.
  
    Some performance critical operations in arch are likely to create
  (approximately but also "at least") two (distinctively sorted)
  tables of the (file-path, inventory-id) relationship and two
  associative tables mapping file paths to ids and ids to inventory
  ids.
  
The listing size for this particular copy of the kernel has a bit over 30,00 lines and is just a bit over 1.5MB long.
Four tables (two associative, two relational) will require
                n_allocations == 4 * (30000 * 2)
                              == 240000
                n_string_bytes == 4 * 1.5MB
                               == 6MB
  
  
    All four tables can be constructed from a single one, returned by
  the inventory primitivie.  That construction, in addition to
  allocating new storage for strings, will copy the contents of each
  string three times: 4.5MB of string copying spread across 180000
  separate strings.
  
These expenses are likely to be small relative to the costs of system calls made by arch and, especially on small trees, are likely to be insignificant. Alas, there is a limit to how much arch's use of system calls can be optimized and my sense is that we are nearing that limit.
    In absolute terms (such as "wall clock time") these and related
  string and table expenses are quite noticable.  For example, several
  arch developers have obtained noteworthy performance improvements
  simply by replacing the unoptimized string copying operations with
  calls to the highly optimized versions typically found in the native
  libc of each host system.
  
    In my own experience, substitution of a silghtly slower malloc
  for a native malloc also has a tangible performance impact and
  quick-n-dirty profiling with gprof suggests that the vast majority
  of these calls originate in libawk.
  
    These calculations and experience with profiling and optimizing 
  suggest that attention to libawk, especially with the aim of
  eliminating string allocations and copies, is likely to be rewarded
  with a somewhat faster, significantly more memory-efficient 
  tla.
  
Secondary Reasons to Want to Change String Handling
    The performance incentives for changing string handling are 
  not the only reason to reconsider string handling in libawk.
  
I have three other reasons (at least):
The libawk Abstractions Do Not Require String Copying
    
      libawk was originally intended to provide mutable tables
  of immutable strings.   In other words, clients of the library
  should be free to add or delete rows or columns, keys and values --
  but they should not ever directly modify a string which is an
  element of a table.
    
      As currently presented, the libawk API does not enforce that
  aspect of its abstraction.  To the degree it is successfully
  honored by the rest of the code in tla (which is not entirely
  clear), the abstraction is honored only by careful coding with
  no particular assistence from the C type system.
    
      Modifying the libawk API to enforce the immutability of 
  strings would make client code of that library easier to 
  understand, maintain, verify, and modify.
  
  
  
    
The Representation Type of libawk Strings Will Probably Change Anyway
    
      It is conceivable that, as tla acquires support for
  extended character sets in filenames and other strings,
  these will be simply be represented internally as 
  Unicode strings, using the UTF-8 encoding form.
    
It is also conceivable (and in my view more sensible) if strings might internally be represented as Unicode strings using a different encoding form (such as UTF-16 or UTF-32, depending on circumstance).
      Regardless, either way, the shift from presumed-ASCII strings
  to strings which sometimes include extended characters is 
  likely to be a difficult one to implement.   (Indeed, one
  justification for beginning tla 2.0 is to avoid having to make
  this shift on the tla 1.x code base by instead providing a
  cleaner-from-the-start foundation in tla 2.0).
    
      If we want to potentiate making the shift to extended characters in
  tla 1.x, modifying the libawk API so that internal string
  representations are better hidden is a definate help.
    
      Currently, much code in tla 1.x assumes that, for example, 
  the string in the third column of the second row of a table
  can be accessed as a value of type t_uchar * using a 
  construct such as:
    
    
        t_uchar * table_elt = table[1][2];
    
    Eliminating such constructs by hiding them behind a procedural abstraction (i.e., not having clients use array subscripting but, instead, having them call a function to access elements) would be progress on both eliminating needless string copying and preparing the way for extended character sets.
Hash-consed and/or Constructive Strings May be Desirable
      [This is, by far, the most speculative rationale for 
  changing libawk.]
    
      String comparison is an important operation in tla, both 
  comparison for equality and comparison for lexical order.
    
Unicode, necessarily, makes comparison operations considerably more expensive than their ASCII-only analogs.
One sometimes applicable technique for reducing the expense of such operations in a string-intensive application is to (a) use only immutable strings; (b) never allocate two strings with identical contents (always share a single string) -- this gives rise to a fast equality test; (c) memoize the results of all string comparisons for order, doing so in a manner that allows a quick inference of the order between two previously uncompared strings provided that they have both been compared to some other string or to two other strings whose order is known (or less expensively inferrable).
Another sometimes applicable technique for reducing string copying and, in particular, speeding up operations such as substring extraction and string concatenation is to use what might be called "constructive strings" (sometimes called "ropes" after an implementation by Hans Boehm).
      Using reference-counted immutable strings in libawk opens the 
  door to trying out either of these techniques.
  
  
  
Program for Research
    I am interested in finding an easy way to eliminate libawk-related
  string copying.
  
    There are many ways a libawk-style API might work without so 
  much string copying but I think that the most straightforward
  of these would be to:
  
1. Reference-count
libawkStrings Strings do not contain references to other objects and thus are not subject to being part of data structure containing a cycle of references. For that reason, reference counting is a natural fit for memory managing them.2. Make the
libawkTypes Opaque The internal representations inlibawkshould be hidden from clients.
    My intention, then, isf to proceed first by making such changes to
  libawk.   This will immediately break a good deal of code 
  throughout tla.   The breakage will be of a variety that a 
  C compiler will report it -- this is convenient because the 
  compiler will help to trace down all code which must be modified
  to honor the corrected libawk abstractions.
  
Current Questions
Have other Arch developers tried a similar experiment already?
Do other Arch developers agree with the conclusions I draw from my envelope calculations and the anecdotal profiling and tuning results?
    What is the comfort-level among Arch developers regarding the
  prospects of migrating such a change, which is likely to be
  very large in scope, from a tla-1.3.1 branch to other branches
  such as baz1.x and tla-1.4?
  
Copyright
Copyright (C) 2004 Tom Lord
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
    See the file COPYING for further information about
 the copyright and warranty status of this work.