The Design and Implementation of the FreeBSD Operating System, Second Edition
Now available: The Design and Implementation of the FreeBSD Operating System (Second Edition)


[ source navigation ] [ diff markup ] [ identifier search ] [ freetext search ] [ file search ] [ list types ] [ track identifier ]

FreeBSD/Linux Kernel Cross Reference
sys/contrib/zlib/doc/txtvsbin.txt

Version: -  FREEBSD  -  FREEBSD-13-STABLE  -  FREEBSD-13-0  -  FREEBSD-12-STABLE  -  FREEBSD-12-0  -  FREEBSD-11-STABLE  -  FREEBSD-11-0  -  FREEBSD-10-STABLE  -  FREEBSD-10-0  -  FREEBSD-9-STABLE  -  FREEBSD-9-0  -  FREEBSD-8-STABLE  -  FREEBSD-8-0  -  FREEBSD-7-STABLE  -  FREEBSD-7-0  -  FREEBSD-6-STABLE  -  FREEBSD-6-0  -  FREEBSD-5-STABLE  -  FREEBSD-5-0  -  FREEBSD-4-STABLE  -  FREEBSD-3-STABLE  -  FREEBSD22  -  l41  -  OPENBSD  -  linux-2.6  -  MK84  -  PLAN9  -  xnu-8792 
SearchContext: -  none  -  3  -  10 

    1 A Fast Method for Identifying Plain Text Files
    2 ==============================================
    3 
    4 
    5 Introduction
    6 ------------
    7 
    8 Given a file coming from an unknown source, it is sometimes desirable
    9 to find out whether the format of that file is plain text.  Although
   10 this may appear like a simple task, a fully accurate detection of the
   11 file type requires heavy-duty semantic analysis on the file contents.
   12 It is, however, possible to obtain satisfactory results by employing
   13 various heuristics.
   14 
   15 Previous versions of PKZip and other zip-compatible compression tools
   16 were using a crude detection scheme: if more than 80% (4/5) of the bytes
   17 found in a certain buffer are within the range [7..127], the file is
   18 labeled as plain text, otherwise it is labeled as binary.  A prominent
   19 limitation of this scheme is the restriction to Latin-based alphabets.
   20 Other alphabets, like Greek, Cyrillic or Asian, make extensive use of
   21 the bytes within the range [128..255], and texts using these alphabets
   22 are most often misidentified by this scheme; in other words, the rate
   23 of false negatives is sometimes too high, which means that the recall
   24 is low.  Another weakness of this scheme is a reduced precision, due to
   25 the false positives that may occur when binary files containing large
   26 amounts of textual characters are misidentified as plain text.
   27 
   28 In this article we propose a new, simple detection scheme that features
   29 a much increased precision and a near-100% recall.  This scheme is
   30 designed to work on ASCII, Unicode and other ASCII-derived alphabets,
   31 and it handles single-byte encodings (ISO-8859, MacRoman, KOI8, etc.)
   32 and variable-sized encodings (ISO-2022, UTF-8, etc.).  Wider encodings
   33 (UCS-2/UTF-16 and UCS-4/UTF-32) are not handled, however.
   34 
   35 
   36 The Algorithm
   37 -------------
   38 
   39 The algorithm works by dividing the set of bytecodes [0..255] into three
   40 categories:
   41 - The allow list of textual bytecodes:
   42   9 (TAB), 10 (LF), 13 (CR), 32 (SPACE) to 255.
   43 - The gray list of tolerated bytecodes:
   44   7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB), 27 (ESC).
   45 - The block list of undesired, non-textual bytecodes:
   46   0 (NUL) to 6, 14 to 31.
   47 
   48 If a file contains at least one byte that belongs to the allow list and
   49 no byte that belongs to the block list, then the file is categorized as
   50 plain text; otherwise, it is categorized as binary.  (The boundary case,
   51 when the file is empty, automatically falls into the latter category.)
   52 
   53 
   54 Rationale
   55 ---------
   56 
   57 The idea behind this algorithm relies on two observations.
   58 
   59 The first observation is that, although the full range of 7-bit codes
   60 [0..127] is properly specified by the ASCII standard, most control
   61 characters in the range [0..31] are not used in practice.  The only
   62 widely-used, almost universally-portable control codes are 9 (TAB),
   63 10 (LF) and 13 (CR).  There are a few more control codes that are
   64 recognized on a reduced range of platforms and text viewers/editors:
   65 7 (BEL), 8 (BS), 11 (VT), 12 (FF), 26 (SUB) and 27 (ESC); but these
   66 codes are rarely (if ever) used alone, without being accompanied by
   67 some printable text.  Even the newer, portable text formats such as
   68 XML avoid using control characters outside the list mentioned here.
   69 
   70 The second observation is that most of the binary files tend to contain
   71 control characters, especially 0 (NUL).  Even though the older text
   72 detection schemes observe the presence of non-ASCII codes from the range
   73 [128..255], the precision rarely has to suffer if this upper range is
   74 labeled as textual, because the files that are genuinely binary tend to
   75 contain both control characters and codes from the upper range.  On the
   76 other hand, the upper range needs to be labeled as textual, because it
   77 is used by virtually all ASCII extensions.  In particular, this range is
   78 used for encoding non-Latin scripts.
   79 
   80 Since there is no counting involved, other than simply observing the
   81 presence or the absence of some byte values, the algorithm produces
   82 consistent results, regardless what alphabet encoding is being used.
   83 (If counting were involved, it could be possible to obtain different
   84 results on a text encoded, say, using ISO-8859-16 versus UTF-8.)
   85 
   86 There is an extra category of plain text files that are "polluted" with
   87 one or more block-listed codes, either by mistake or by peculiar design
   88 considerations.  In such cases, a scheme that tolerates a small fraction
   89 of block-listed codes would provide an increased recall (i.e. more true
   90 positives).  This, however, incurs a reduced precision overall, since
   91 false positives are more likely to appear in binary files that contain
   92 large chunks of textual data.  Furthermore, "polluted" plain text should
   93 be regarded as binary by general-purpose text detection schemes, because
   94 general-purpose text processing algorithms might not be applicable.
   95 Under this premise, it is safe to say that our detection method provides
   96 a near-100% recall.
   97 
   98 Experiments have been run on many files coming from various platforms
   99 and applications.  We tried plain text files, system logs, source code,
  100 formatted office documents, compiled object code, etc.  The results
  101 confirm the optimistic assumptions about the capabilities of this
  102 algorithm.
  103 
  104 
  105 --
  106 Cosmin Truta
  107 Last updated: 2006-May-28

Cache object: a09f8465e0ef536717b5bc95ee99dfb8


[ source navigation ] [ diff markup ] [ identifier search ] [ freetext search ] [ file search ] [ list types ] [ track identifier ]


This page is part of the FreeBSD/Linux Linux Kernel Cross-Reference, and was automatically generated using a modified version of the LXR engine.