Phonecode in 8 Lines!

back to Axiomatic Language Home Page

Phonecode in 8 Lines!
1. Introduction

This webpage gives an axiomatic language specification for the "phonecode" problem of Lutz Prechelt, [Prechelt 2000, 2000TR, 2003]. This problem is to encode phone numbers as sequences of words from a dictionary, given a mapping of letters to digits. This webpage will be gradually refined in preparation for the July 8, 2024 short paper deadline for ICLP 2024. A definition of axiomatic language can be found here.

The software engineering assumption for this project is that programming time is roughly the same per line of (non-comment) code, regardless of language level. So, if a C language program for a problem is typically 1/3 the size of an assembly language program for the same problem (ignoring comment lines), we would expect programmer productivity in C to be roughly 3 times greater.

Another assumption for this project is that the automatic transformation of axiomatic language specifications to efficient programs for problems like this can eventually be achieved. Thus, this currently-unexecutable specification can be considered an actual solution to the phonecode problem.

Section 2 talks about claims and evidence for the idea that a higher-level programming language that enables shorter source programs leads to shorter programming time and thus greater programmer productivity. Section 3 defines the phonecode problem and section 4 gives an axiomatic language specification for that problem. Section 5 gives an expanded, tutorial version of the phonecode specification with additional comments inserted to help explain the details. Finally, section 6 gives some concluding comments and dicusses future work.

2. Roughly Constant LOC/hr, Regardless of Language?!

The idea that programming time per line of non-comment code (loc) is roughly constant regardless of language level was an early observation in software engineering. Fred Brooks writes "Productivity seems constant in terms of elementary statements, a conclusion that is reasonable in terms of the thought a statement requires and the errors it may include." [Brooks 1975, p.94] Prechelt [2000, p.25] refers to "Common software engineering wisdom, which says that 'the number of lines written per hour is independent of the language'".

Prechelt's phonecode results support this claim, where lines-of-code are "anything that contributes to the semantics of the program in each of the program source files, e.g. a statement, a declaration, or at least a delimiter such as a closing brace (end-of-block marker)." [Prechelt 2003, p.31] Based on median program lengths and programming times, "non-scripts [C, C++, Java] are typically two to three times as long as scripts [Perl, Python, Rexx, Tcl]" [ibid., p.31] and "scripts (total median 3.1 hours) take only about one third of the time required for the non-scripts (total median 10.0 hours)" [ibid., p.37], with some caveats, such as the script times being self-reported.

The median loc/hr values for the seven languages are shown in Fig. 24 of [ibid., p.41] and are all between 20-40. We therefore accept the assumption of roughly constant loc/hr regardless of language. Paul Graham [2002] argues this and makes the logical inference that "the main point of high-level languages is to make source code smaller". For very-high-level, declarative, functional, and logic languages, there may be unrealized potential to make our programs smaller and thus achieve greater programming productivity (without sacrificing readability). Exploring that possibility is the motivation for this project.

3. The Phonecode Problem

The phonecode program reads a text file of zero-or-more phone numbers, one per line, and a text dictionary file of zero-or-more words, one per line, and "encodes" the phone numbers as word strings, interspersed as needed with phone number digits. A phone number is a string of zero or more digits, dashes(-), and slashes (/). A dictionary word is a string of (upper- and lowercase) letters, dashes (-), and double quotes ("), starting with a letter.

The following sequence of letter groups defines a mapping of letters to the digits 0-9:

     (e jnq rwx dsy ft am civ bku lop ghz)
  ->  0  1   2   3  4  5   6   7   8   9

(E.g., letter 'd' maps to digit '3'; letter 'H' maps to '9'.) From this mapping of letters to digits, we get a mapping of dictionary words to digit strings. ?? Special characters in phone numbers and dictionary words are ignored in the encoding.

The encoding of a phone number to words is done left-to-right in incremental steps. At any point a prefix of a phone number will have been encoded to a word string with possible included digits. An encoding step attempts to match the starting digits of the not-yet-encoded phone number suffix with digit strings for dictionary words. For each match the dictionary word is appended to the encoding and the matching digits are removed from the phone number suffix. If there are no matches, a single phone number digit is moved to the encoding, but only if the previous encoding step did not move a digit. (An encoding thus cannot have consecutive digits.) When a sequence of encoding steps has "consumed" the phone number, we have found an encoding for that number. A phone number may have zero or more encodings.

The phone number - encoding pairs found are listed in an output file, one pair per line, in the following format:

  _phonenum: {[_dig] _word} [_dig]

Special characters are included in _phonenum and _word.

See the Prechelt papers [2000, 2000TR, 2003] for his natural language decription of the phonecode requirements along with examples (and let me know if I have misinterpreted them in my paraphrase!).

4. Axiomatic Language Phonecode Solution

The axiomatic language specification below for the phonecode problem has just 8 non-comment lines of code. Since axiomatic language is an extremely minimal language, it must be built-up with assorted utility functions, such as numbers, higher-order forms, Boolean functions and relations, etc. These utilities would be provided in a standard library, which would be shared by all applications, and thus this reusable code would not be counted as part of the source code for any particular application. Axiomatic language encourages this type of generalization and abstraction of "general purpose functionality" from a specific application, thus making the application shorter.

Another characteristic of this phonecode solution is a preference for short names and a "dense" style where composed functions are often written on the same line for conciseness (and to keep the line count down).

Some abbreviations used in the source code and its comments are as follows:

  # - number of something (#apples)
  > - denotes a function/map from an expression domain to a range
    domain>range - (usage)
  >=0 - zero-or-more
    >=n - greater-than-or-equal-to n
  ~ - prefix for boolean function symbol that complements the function 
  arg - argument for a function or predicate
    iarg - input argument
    oarg - output argument (result of a function)
  asn - assign expr to a position in a seq (replacing old expr at that pos)
  ax(s) - axiom(s), set of axioms
  bin, b - binary (function or relation) - 2 input arguments
  bool, b - Boolean function/predicate (that uses/yields true/false values)
  cat - [con]catenate char strings
    cat* - catenate a sequence of char strings into a single string
  char, ch - character
  conclu, conc - conclusion (of an axiom)
  cond - condition (of an axiom)
  consec - consecutive (elements in a sequence)
  cur - "currying" - a prefix of input args is moved into a function expr
  def - define (a name for a function or predicate expression)
  dic - dictionary (of phonecode code words)
  dig, d - digit
  dsym - decimal digit symbol: 3, 25
  dup - duplicate (an input arg)
  elem, el - "element" (expression) in a sequence
      (sometimes string vars in a seq may be called elems of that seq)
  elim - eliminate
  expr, ex, e, x - expression
  f, `f - false (Boolean value)
  filt - filter out (remove) seq elems which fail a Boolean predicate
  fmt - format
  func, fn, f - function
  id - identifier; identity function
  ins - insert (an expression into a sequence)
  ins_btw - insert an expression between elements of a sequence
  ix, i - index
  lc, lcltr - lowercase (letter)
  len, l - length of a sequence
  ltr, l - letter
  num, nat, natnum - natural number
  outf - output file
  ph#, phonenum - phone number
  phcode, phc - phonecode (problem)
  pos - position of an expr in a seq (indexed from zero)
  pred - predicate (name)
  rel, r - relation
  repl - replace (seq expr with result of a unary fn applied to that expr)
  rev - reverse (a sequence)
  rmv - remove (delete) an expr from a sequence
  seq - sequence
  str - string of some "type" (often chars) within a seq
  sym - symbol
  t, `t - true (Boolean value)
  uc, ucltr - uppercase (letter)
  un - unary (function)
    bun - binary func w its 2 iargs in one 2-elem seq (-> unary fn)
  w - with

A "type" or expression syntactic category may be denoted by a name beginning with an underscore: _num, _ltr. A string of >=0 of the same type within a sequence may be denoted by an underscore before and after the type name: _dig_.

A program that reads multiple input text files and writes a single output text file can be specified by a set of "Program" valid expressions of the following form:

  (Program _infiles_ _outfile)  - >=0 input files, 1 output file

A text file is a sequence of lines, each a character sequence.

A functional style is used for many predicates:

  (_fn _iargs_ _oarg)   - func on >=0 input args yields a single output arg
     - input arguments are numbered from 0: _iarg0 _iarg1 ....

Note that in addition to single-line comments,

   ...code...  ! comment...

we use multi-line comment blocks framed by '!/' and '!\' markers that start in column 1:

  !/ comment...
  comment...
  comment...
  !\ comment...

Comment blocks can be nested.

Recall that syntax-extension symbols can start with any printable character other than blank ` $ % ! ( ) ' " and symbol/atom/variable characters after the first can be any printable characters other than blank ( ). We often use special characters for, say, higher-order forms.

Our axiomatic language phonecode solution is given below. Indices in column 1 count the non-comment source lines.

 
!---+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8
! --- Phonecode top-level program axiom ---

1 (Program %dic %ph#s %outf)<      ! Program phonecode
!/ - encode phone numbers with dictionary words using letter>digit map
  input arg text files:
    %dic - >=0 words of dictionary, one-per-line
      _word is uc/lc letter + >=0 letters and special chars - "
    %ph#s - phone numbers to encode, one-per-line
      phone number is >=0 digits and special chars - /
  output arg text file:  %outf - phonenum/encoding pairs, one per line:
    _ph#: _wordd .. _wordd  - encoding "wordd"s separated by blanks
    - _wordd is dic word or single ph# dig if no possible word at that point
      - consecutive digit-wordd's are not allowed
    - Phone number encoding is incrementally computed left-to-right.
!\  - A ph# might not have an encoding and thus wouldn't be in the output.    

! get map of ph# digits to dictionary words
!/ Given a sequence of lc-letter-group symbols, we assign digit index '0'..'9'
   to each letter based on its group's sequence index.  This letter-to-digit
   mapping, applied to %dic, gives us a word-to-digit-string mapping,
!\ from which we get a digit-string-to-dictionary-word map.
2  ((ltr>dig digs>word) (e jnq rwx dsy ft am civ bku lop ghz) %dic %digs>word),
                       ! 0  1   2   3  4  5   6   7   8   9
   ! Note that we hard-code the letter-to-digit map within the Program axiom.
   ! ltr->dig map  +  dictionary  -->  digits->word map

! compute all stream solutions of digs>word map applied to phone number digits
3  (((* (filt dig?)) (* (eat (bite1 %digs>word)))
4    (* (filt (~consec? dig?)))) %ph#s %codes), 
!/ We remove special chars from phone numbers ph#s to get digit-only ph#ds.
   We then incrementally "eat" prefixes from each ph#d while outputting
   dictionary words whose digit string matches each prefix.  If no word can
   be matched at some point, a single digit of ph#d is copied to the output.
   The resulting "wordd" sequence (wordd = dic word or phonenum digit)
   is a candidate encoding except that we then eliminate encodings where
!\ consecutive wordds are both digits.

! format phone numbers and their codes for output file
5  (((* (.- join)) cat* *fmt) %ph#s %codes %outf).
   ! - %outf - output is text file of ph#-code pairs, 1 per line
!/ Given original phone numbers and their encodings, we get
!\ phonenum/code pairs and then format each of these for output.

! --- Phonecode app supporting functions ---:

! ltr>dig - define letter-to-digit mapping from seq of lc letter-group symbols
6 (def ltr>dig (*sel1 el>ix1 dup lc>uc cat)).
  !  -- Could this be a std lib fn?!

! digs>word - get map of digit strings to dictionary words
7 (def digs>word ((skip 1 (dup (* (filt ltr?)))) (.-- map) *join)).
!/ Given the ltr>dig map and a dictionary of words, this function applies
!\ the map to the dictionary words to get a word-digits_to_word map.

! fmt - format phonenum-code pair for output line
8 (def fmt ((cur ins 1 ": ") (repl 2 (cur ins_btw ' ')) chxstr)).
!/ Given a ph#-code pair, this function first inserts ": " between the
   phonenum and its code, then inserts a blank between each "wordd" of
   the code, and then flattens the nested character expr into a single
!\ character string, which will be a line of the output file.

Links to standard library source files and some of the predicates they define are as follows:

form.ax - utilities involving the form of expressions
(join _a _b (_a _b)) - join two argument exprs into 2-elem seq
(cat (_str1_) (_str2_) (_str1_ _str2_)) - catenate two seqs into one
(cat* ((_strA_) .. (_strZ_)) (_strA_ .. _strZ_)) - cat seq of seqs into one
(rev _seq _rev) - reverse a sequence
(ins_btw _e _seq _seq') - insert expr _ between seq elements

ho.ax - higher-order predicates and functions
(_f_) - function composition on input arg stack, applied left-to-right
dup - duplicate the first input argument in a function composition
swap - swap the first two input arguments for a function composition
(cur _f _iargs0_) - "currying" - input arg prefix is part of function
(* _pred) - apply predicate to sequences of arguments
(.- _binfn/rel) - apply binary fn/rel to arg0 and all elems of arg1 seq
(.-- _binfn/rel) - apply binary fn/rel to arg0 and all elems of arg1 seq seqs
(bun _binfn/rel) - apply binary fn/rel to 2-elem seq (makes it unary)
(map _map _ex _ex') - apply _map, consisting of (_(_ex _ex')_) pairs, to an expr

natnum.ax - natural number predicates and functions
-- no utilities directly used

bit.ax - predicates and functions that use bit representation
-- no utilities directly used

char.ax - predicates and functions involving characters
*pred - apply predicate to sequences of arguments
lc>uc - map lowercase letters to uppercase in nested ch expr
sel_dsym - select elem [_dsym] from a sequence [0..]
(el>ix1 _elgroups _el>dig) - given seq of elem group seqs, get elem-to-1-dig-index map
(skip _dsym _f) - skip _dsym iargs, then apply fn
(repl _dsym _uf) - replace elem [_dsym] of iarg0 seq w unary fn _uf applied to it
(ins _dsym _e _seq _seq') - insert expr before elem [_dsym] of seq
(chxstr _chexpr _chstr) - flatten nested char expr into a char string

bool.ax - Boolean predicates and functions (- ending ? is a convention)
_type? - returns true/false if expr is of given type
_dig?, _ltr? - returns t/f if char expr is digit, letter
~_bfn? - returns complement of Boolean function _bfn?
(consec? _type?) - t/f if seq has consec elems of given type

stream.ax - incremental stream processing functions
(bite1 _map) - if possible, bite a stream prefix that matches map, outputting expr;
- else, copy 1 stream elem to output
(eat _bite) - "eat" a stream using all possible bite sequences, outputting expr seqs

5. Expanded, Tutorial Axiomatic Language Solution Presentation

To help the user understand the above specification, we have re-written it over multiple lines with extra comments inserted. Some comments document the actions of individual functions; others show the input argument stack before and after the function invocation. (Our assumption is that a future axiomatic language programmer would have comprehensive familiarity with the standard utility library, so an expanded presentation like this would not be necessary.)

!---+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8
! *** expanded program axioms with inserted comments ***
! --- Phonecode top-level program axiom ---


1 (Program %dic %ph#s %outfile)<     ! Program phonecode

!/ copied from above ...
  - encode phone numbers with dictionary words using letter>digit map
  input arg text files:
    %dic - >=0 words of dictionary, one-per-line
      _word is uc/lc letter + >=0 letters and special chars - "
    %ph#s - phone numbers to encode, one-per-line
      phone number is >=0 digits and special chars - /
  output arg text file:  %outf - phonenum/encoding pairs, one per line:
    _ph#: _wordd .. _wordd  - encoding "wordd"s separated by blanks
    - _wordd is dic word or single ph# dig if no possible word at that point
      - consecutive digit-wordd's are not allowed
      - Phone number encoding is incrementally computed left-to-right.
!\  - A ph# might not have an encoding and thus wouldn't be in the output.    


!/ copied ...
! get map of ph# digits to dictionary words
!/ Given a sequence of lc-letter-group symbols, we assign digit index '0'..'9'
   to each letter based on its group's sequence index.  This letter-to-digit
   mapping, applied to %dic, gives us a word-to-digit-string mapping,
!\ from which we get a digit-string-to-dictionary-word map.
2  ((ltr>dig digs>word) (e jnq rwx dsy ft am civ bku lop ghz) %dic %digs>word),
                       ! 0  1   2   3  4  5   6   7   8   9
   ! Note that we hard-code the letter-to-digit map within the Program axiom.
   ! ltr->dig map  +  dictionary  -->  digits->word map

! expanded ...
! get map of ph# digits to dictionary words
 ((     ! (e jnq rwx ... lop ghz) _dic   - input args for this ax cond fn
   ltr>dig     ! get map of upper/lower-case letters to digits
        ! (('E' '0') .. ('Z' '9') ('e' '0') .. ('z' '9')) _dic
   digs>word   ! apply ltr>dig map to dictionary
        ! _digs>word - map of ph# digit prefixes to dictionary words
        ! (.. (_digits _dic-word) ..)
  ) (e jnq rwx dsy ft am civ bku lop ghz) %dic %digs>word),


!/ copied ...
! compute all stream solutions of digs>word map applied to phone number digits
3  (((* (filt dig?)) (* (eat (bite1 %digs>word)))
4    (* (filt (~consec? dig?)))) %ph#s %codes), 
!/ We remove special chars from phone numbers ph#s to get digit-only ph#ds.
   We then incrementally "eat" prefixes from each ph#d while outputting
   dictionary words whose digit string matches each prefix.  If no word can
   be matched at some point, a single digit of ph#d is copied to the output.
   The resulting "wordd" sequence (wordd = dic word or phonenum digit)
   is a candidate encoding except that we then eliminate encodings where
!\ consecutive wordds are both digits.
!\

! expanded ...
! compute all stream solutions of digs>word map applied to phone number digits
 ((   ! _ph#s
   (* (filt dig?))  ! filter out non-digits from ph#s - get digit-only ph#ds:
      ! _ph#ds
   (* (eat (bite1 %digs>word)))  !incrementally "eat" each ph#d to get "codeX"s
      ! - each "bite" matches digit prefix to digs>word map to get output word
      !   If no match, the bite1 fn copies 1 ph#d digit to the output.
      ! - Thus, an output "codeX" encoding is a sequence of wordds, each a
      !   dictionary word or a single digit.  All possible codeXs are
      !   generated by this process, where a codeX encoding may have an
      !   arbitrary number of ph#d digits:      
      ! (.. _codeXs-for-ph#i ..)  - >=1 codes for each ph#
      !   - codeX "wordd" is digit where dictionary word not possible
      !   - Note that each ph# will have at least one codeX since ph#d itself
            is a possible codeX (for example, if the dictionary is empty).
      ! - But consecutive digits are not allowed for ph# encodings!!
   (* (filt (~consec? dig?)))  ! elim codes w consecutive digits
      ! (.._(valid-)codes-for-ph#i..)  - no consec digits in codes
      ! - Note that it's possible a ph# may not have any valid encodings.
  )   ! end of fn composition returns valid codes for each ph#
  %ph#s %codes),


!/ copied ...
! format phone numbers and their codes for output file
5  (((* (.- join)) cat* *fmt) %ph#s %codes %outf).
   ! - %outf - output is text file of ph#-code pairs, 1 per line
!/ Given original phone numbers and their encodings, we get
!\ phonenum/code pairs and then format each of these for output.
!\

! expanded ...
! format phone numbers and their codes for output file
 ((   ! _ph#s _codes_for_ph#s
   (* (.- join))  ! join each ph# w each of its codes
      ! (.. (.. (_ph#i _code-for-ph#i) ..) ..)
      !  - note that ph#-code pairs are grouped by ph#
   cat*   ! make a flat list of ph#-code pairs
      ! (.. (_ph# _code) ..)
          ! - if a phone num has no codes, it will not appear in any pair
   *fmt   ! format each ph#-code pair into output line
      ! (.. "ph#: wordd .. wordd" ..)   - this is the output text file!
      !  - returned in %outf
  ) %ph#s %codes %outf).

! --- Phonecode app supporting axioms ---:


!/ copied ...
! ltr>dig - define letter-to-digit mapping from seq of lc letter-group symbols
6 (def ltr>dig (*sel1 el>ix1 dup lc>uc cat)).
  !  -- Could this be a std lib fn?!
!\

! expanded ...
! ltr>dig - define letter-to-digit mapping from seq of lc letter-group symbols
 (def ltr>dig     ! define ltr>dig function
  (    ! (e jnq rwx ... lop ghz)   - lc letter groups - alphabet partition
   *sel1   ! sel elem [1] of each letter group symbol (- gets sym char string)
       ! ("e" "jnq" ... "lop" "ghz")
   el>ix1  ! get map of lc letter group elements to single digits
           ! (we index seq elems from ['0'])
       ! (('e' '0') .. ('g' '9') ('h' '9') ('z' '9'))  - lcltr-to-dig map
   dup   ! duplicate the first arg
       ! (('e' '0') .. ('z' '9')) (('e' '0') .. ('z' '9'))
   lc>uc  ! map lc letters to uppercase
       ! (('E' '0') .. ('Z' '9')) (('e' '0') .. ('z' '9'))
   cat   ! concatenate lc- and uc-maps into single letter>digit map
       ! (('E' '0') .. ('Z' '9') ('e' '0') .. ('z' '9'))
  )).    ! end of fn composition and fn definition
         ! returns ltr>dig map
           

!/ copied ...
! digs>word - get map of digit strings to dictionary words
7 (def digs>word ((skip 1 (dup (* (filt ltr?)))) (.-- map) *join)).
!/ Given the ltr>dig map and a dictionary of words, this function applies
!\ the map to the dictionary words to get a word-digits_to_word map.
!\

! expanded ...
! digs>word - get map of digit strings to dictionary words
 (def digs>word
  (    ! _ltr>dig_map _dic  - the two input args 
   (skip 1   ! skip past 1 arg, then apply function
     ((dup   ! duplicate the dictionary arg
           !     "     _dic _dic
      (* (filt ltr?))  ! remove non-letters from dictionary words
           !     "     _dicl _dic
           !   - _dicl - letter-only words
   ) )
       ! _ltr>dig_map _dicl _dic
   (.-- map)  ! apply ltr>dig map to letter-only dictionary words
              ! - result is a unique string of digits for each word
         ! _(word-)digits _dic
   *join  ! pair each digit string with original dictionary word
       ! (.. (_digits _word) ..)  - map of digit strs to original words
 )).      ! end of fn composition and fn definition
          ! digs>word map is returned!


!/ copied ...
! fmt - format phonenum-code pair for output line
8 (def fmt ((cur ins 1 ": ") (repl 2 (cur ins_btw ' ')) chxstr)).
!/ Given a ph#-code pair, this function first inserts ": " between the
   phonenum and its code, then inserts a blank between each "wordd" of
   the code, and then "flattens" the nested character expr into a single
!\ character string, which will be a line of the output file.
!\

! expanded ...
! fmt - format phonenum/code pair for output line
 (def fmt
  (   ! (_ph# _code)  - a ph#-code pair
   (cur ins 1 ": ")  ! insert punctuation between ph# and code 
      ! (_ph# ": " _code)
   (repl 2 (cur ins_btw ' '))  ! insert blanks between code words/digits
      ! (_ph# ": " (_wordd ' ' .. _wordd))
      ! - We replace arg 2 w result of unary fn that inserts blank between
      !   its sequence elements.
   chxstr   ! flatten nested char expr into single string
      ! "_ph#: _wordd .. _wordd"  - syntax elems in char string are
      !                             blank-separated and concatenated
      !  - this is an output text file line
 )).

6. Final Comments

See section "5 Conclusions and Future Work" of paper.

References

[Brooks 1975] Frederick Brooks, The Mythical Man-Month: Essays on Software Engineering, Addison-Wesley.

[Graham 2002] Paul Graham, "Succinctness is Power".

[Prechelt 2000] Lutz Prechelt, "An empirical comparison of seven programming languages", IEEE Computer, vol. 33, no. 10, pp. 23-29, Oct. 2000, doi: 10.1109/2.876288, link.

[Prechelt 2000TR] Lutz Prechelt, "An empirical comparison of C, C++, Java, Perl, Python, Rexx, and Tcl for a search/string-processing program", Dept. of Informatics, Univ. of Karlsruhe, Technical Report 2000-5, Mar. 2000.

[Prechelt 2003] Lutz Prechelt, "Are Scripting Languages Any Good? A Validation of Perl, Python, Rexx, and Tcl against C, C++, and Java", in Advances in Computers 57:205-270, Dec. 2003, doi: 10.1016/S0065-2458(03)57005-X, link.

back to Axiomatic Language Home Page