UNIVERSAL SERIALIZED CHANGESET DATA FORMAT SPECIFICATION FOR GENERAL SOFTWARE CONFIGURATION MANAGEMENT SYSTEM (Draft) May 2005 prepared for All Software Configuration Management Hackers by Tez Kamihira Table Of Contents Preface 1. Introduction 1.1. Motive 1.2. Fundamental Principles 1.3. Scope 1.4. Structure Of This Document 2. Terminology 2.1. Overview 2.2. Unidiff Output Sample 2.3. Definitions 3. Differential Unit 3.1. Layout 3.2. Extended Unidiff Unit 3.3. Binary Unit 4. Hunk 4.1. Standard Hunk 4.2. Binary Hunk 4.3. Property Hunk 4.4. ID Hunk 5. Differential Calculation 6. Applying Semantics 6.1. Notation 6.2. Algorithm 6.3. Error Recovery 7. Unit Header 7.1 Extended Marker 7.2 File Name Field Convention 7.3 Time Field Convention APPENDIX A: Some Examples APPENDIX B: Formal Syntax REFERENCES PREFACE This document tries to define a special file format so-called "Changeset" which has been used by several Software Configuration Management Systems (SCM) to describe the difference between arbitrary two directory trees. One of the most famous format along the lines is GNU arch's[1]. But unfortunately, this format is a directory structure which consists of many structured files and subdirectories, and clearly not suitable for handy transfer by e-mail nor similar text-based communication channels. In this document, mainly based on de-facto Unidiff format and an early research by GNU arch, I'll try to define another type of file format called "Serialized Changeset", under the conditions that (1) it doesn't depend on any specific SCM system and (2) as a single monolithic file format. I hope this document will be useful for anyone who is interested in inter-operability and easy data exchanges between any two different SCMs, and also it will take the first step to re-understand the key concept "Changeset" more consciously, more reflexionally, more uniformly, and more abstractly. Tez Kamihira (tez@kamihira.com) May 2005 1. Introduction 1.1. Motive In the world of freely available software, to express the total modification under the directory, we almost always invoke "diff" tool, make a well-known Unidiff format text, embed it in E-mail, and transfer to another person. At the receiving side, he/she invokes "patch" tool paired with "diff", applies the information to their target trees, and reproduce the proper result. Recently thanks to the many developers in this area, various value-added SCM systems have been made and we can exchange the difference of our data with others more correctly and more conveniently. Using SCM for data exchange, we can send not only file contents itself, but also additional data just like renamed file information, binary file's difference, the properties attached to the files, so on. Such benefits drive many people to make various "yet another SCM" to satisfy the case-by-case needs and actually they have given us a huge amount of freedom to select the most suitable SCM for the individual project. But on the other hand, this trend results in the serious data incompatibility among many SCMs. This document will try to define a new data format mainly based on the well-known Unidiff format to exchange any kind of difference occurred in one SCM system with another, even if the latter system is not the same as the former. I'll call the data format "Serialized Changeset" and is just noted as "SCS" shortly below. Roughly speaking, SCS is defined by the collection of top-level sub-structure named "Differential Unit (or simply Unit)" which then consists of sub level structure named "Hunk". "Hunk" is well-known word in Unidiff, but it'll be slightly extended for additional information. 1.2. Fundamental Principles The fundamental design concept is as following: Principle 1: Readability SCS must be easily readable by human being as much as it can, just like traditional Unidiff as well. Principle 2: Monolithity SCS must be represented by a single text-based monolithic file which is supposed to be exchanged by E-mail. Principle 3: Upper Compatibility SCS must be upper compatible with Unidiff and almost all the line semantics are easily guessed by every traditional Unidiff user. Principle 4: Neutrality SCS MUST NOT depend on any kind of specific SCM system. It should be completely neutral beyond such all systems. 1.3. Scope This specification was initially inspired by Directory-based Changeset defined by GNU arch revision control system[1]. But actually SCS concept isn't related to any kind of specific SCM directly. It is provided simply for describing the difference between arbitrary two directory trees. Adding another option to GNUdiff/GNUpatch and allowing it to accept SCS as well would be an interesting application, for example. 1.4. Structure Of This Document In this document, after reviewing the traditional Unidiff format and every term of it, SCS internal structure will be explained by top-down approach, that is: SCS -> Unit -> Hunk order. Then it handles actual calculation and applying algorithms. These are roughly corresponding to diff/patch operations in Unidiff world. In chapter 2, all terminologies used in this document will be properly defined. Starting from the well known Unidiff format, we will fix every word which has been vaguely and unconsiously used by traditional Unidiff, and introduce SCS specific terms as well. In chapter 3, the top-most SCS sub-structure called "Differential Unit (or simply Unit)" will be explained in detail. SCS is a collection of Unit which is very similar data structure with Unidiff one. SCS's Unit consists of two sub-types. The first one is called Extended Unidiff Unit which is some extended version of original Unidiff's. The second one is called Binary Unit which is newly introduced in SCS to describe the difference of general binary data. Every Unit is, again, a collection of finer sub-structure named Hunk. In chapter 4, all Hunk types will be explained. There exist totally four types now. Standard Hunk is just Unidiff Hunk itself. Rest of them are ID Hunk, Property Hunk, and Binary Hunk. Extended Hunks enable us to describe much more detail informations about the tree than simply using Unidiff. SCS can handle not only regular files, but also special files such as directories, symbolic links, binary files which are not calculate the difference easily by legacy Unidiff method. In chapter 5, specific differential rule will be explained to calculate for all file types uniformly. When applying SCS information to target tree, several SCS specific issues will arise, all of which we didn't meet in the legacy Unidiff world. In chapter 6, we will get all complex and subtle issues about SCS semantics straight and present specific algorithm for applying it to the target tree. 2. Terminology 2.1. Overview All terminologies used by SCS format will be presented in this chapter. Because SCS is a kind of extension of Unidiff, after a typical Unidiff output will be showed, each part of it will be defined explicitly. Extended term isn't found in traditional Unidiff sample, but is also defined along the lines. Unidiff doesn't seem to have any kind of formal specification except the source code itself. Notice that this chapter will be also devoted to make Unidiff's own concepts more clear as well as to introduce SCS specific ones. 2.2. Unidiff Output Sample Let "./a" and "./b" are two directory and there exist "foo.txt" and "bar.txt" under "./a". Moreover let any modifications to those files be put into "./b". On such premises, the output of "diff" program would be the following: 1: diff -ur a/bar.txt b/bar.txt 2: --- a/bar.txt 2005-05-23 10:58:17.583015024 +0900 3: +++ b/bar.txt 2005-05-23 10:59:12.976593920 +0900 4: @@ -1,5 +1,6 @@ 5: aaaaaaaaaa 6: +bbbbbbbbbb 7: cccccccccc 8: dddddddddd 9: eeeeeeeeee 10: -ffffffffff 11: \ No newline at end of file 12: +ffffffffff 13: diff -ur a/foo.txt b/foo.txt 14: --- a/foo.txt 2005-05-23 10:34:11.199898680 +0900 15: +++ b/foo.txt 2005-05-23 10:35:02.941032832 +0900 16: @@ -1,7 +1,7 @@ 17: It was many and many a year ago, 18: In a kingdom by the sea, 19: That a maiden there lived whom you may know 20: -By the name of ANABEL LEE; 21: +By the name of ANNABEL LEE; 22: And this maiden she lived with no other thought 23: Than to love and be loved by me. 24: 25: @@ -34,7 +34,7 @@ 26: And neither the angels in heaven above, 27: Nor the demons down under the sea, 28: Can ever dissever my soul from the soul 29: -Of the beautiful Anabel Lee. 30: +Of the beautiful Annabel Lee. 31: 32: For the moon never beams without bringing me dreams 33: Of the beautiful Annabel Lee; The line numbers appeared in the following sections correspond to the above lines. 2.3. Definitions Unidiff Format The legacy differential format generated by Unix's "diff" program such as GNU diff with -u option. There aren't any specific documents that refer to this format except their implementations. Serialized Changeset (SCS) All modifications considered as a single entity as a whole in a file, and it will be defined formally in the below documents. Also noted SCS shortly. SCS format is quite similar with the output of "diff -Nur". The only extended parts from Unidiff are basically a few special Hunks. Differential Unit (Unit) Such lines as 2-12, 14-33 are called Differential Unit or simply Unit. Every Differential Unit consists of a header and collection of Hunks defined below. It'll be noted shortly "Unit" as well. Differential Unit Header First and second lines of Differential Unit. In the above example, line 2, 3, 14, 15 are headers. Differential Unit Marker In the case of Unidiff, the beginning of Unit Header starts with "---" or "+++" substring and tells us which line is the beginning of actual Differential Unit. Those strings are called Unit Markers. There are other additional markers defined by SCS to describe each unit's nature in more detail. The basic property of each Differential Unit can be specified completely by the marker type. Standard Marker, Extended Marker The origial "---" and "+++" marker that appear in legacy Unidiff are called standard markers and others are called extended marker. Extended markers are newly introduced by SCS. In the above sample, there aren't any extended markers. They will be defined in the below document separately. Hunk, Hunk Header, Hunk Body Lines 4-12, 16-24, 25-33 are started by "@@"-string and we call them Hunks. This term has already been used by legacy Unidiff as well. The beginning line of the Hunk is called Hunk header and rest of them are called Hunk body. In the case of above example, it consists of two Differential Units. First one has only one Hunk. Second one has two. SCS defines a few additional Hunks. They can describe the tree information in more detail. See also later sections. Numeric values appeared in Hunk Header Hunk Header usually has four numeric values. They are called as pre-offset, pre-size, post-offset, and post-size. Junk Lines line 1 and 13 are called Junk Lines, and patch program doesn't actually need them. It's only for human. Line attribute in Hunk bodies Every line in Hunk body starts with ' ', '-', '+', or '\' character. They are called invariant line, deleted line, appended line, and EOLS(End of line status), respectively. Even if extended Hunks in SCS also start with these characters, the meanings would be slightly different from original ones appeared in legacy Unidiff. Standard Hunk, Extended Hunk The Hunk appeared in traditional Unidiff is called standard Hunk, and SCS specific Hunks are called extended Hunks. Extended Hunks are divided into three types, that is, ID Hunk, property Hunk, and binary Hunk. Above example has standard Hunk only. Property Hunk Some of the SCM systems that are widely used now can also manage meta-information or property information attached some files, as well as the actual contents of them. Property Hunk is provided to record such kind of information. The detail will be discussed below. There are no property Hunks in above example. ID Hunk Some of the system can track such event as file renaming. This mechanism assumes that some invariant true name exists behind the flexible file names. This universal name is called "inventory", "file ID", or "node ID" depending on the specific SCM. We call it "Item ID" in this document. ID Hunk is provided to record this invariant name and consists of one line Hunk header only. The detail will be explained at the other part of this document. There are no ID Hunks in the above example. Item Each managed entity by SCM is called Item. Most of the Items refer to the regular files. But sometimes they refer to special files as symbolic link or directory as well. The would be also managed by some SCMs. There are three types of Items, that is: directory Item: refers to some directory. symbolic Item: refers to some symbolic link. regular Item: refers to some regular file. That's all. This attribute is called Item type. Legacy Unidiff can only handle regular files, but SCS also handles any other Item types and their transitions. Item ID See ID Hunk. Item Type Transition Every Item could be changed to another type of Item. This transition is called Item Type Transition. It's very unnatural event to change the Item types each other, because it means, for example, a directory changes to a regular file, or a regular file changes to a symbolic link, etc,. Some SCM prohibits such changes, but others are permitted. Because we want to add pretty general abilities to SCS format not depend on any SCMs, Item type transition is broadly supported. Null Item Set aside normal three types of Items, we also consider the "non-existed Item state" virtually as well. It is called Null Item. Using this idea, we can distinguish whether the file simply truncated to size zero or actually removed. The same discussion is also hold between pure file creation and simple inflation from an empty file. Item Contents For every Item type, the contents of the Item is assumed, which is called Item contents. In the case of regular Item which refers to regular file, file contents has literal meaning, but it's defined for rest of the Item types as well. See the chapter of differential calculation in detail. The operation will be executed between two Item contents. Binary Difference When calculating the difference between two files, Unidiff would be failed by several reasons. In those cases, SCS MUST still record the difference by some kind of binary differential algorithm. It includes the special case that both original and modified files are simply recorded as is. The actual contents of the difference will be preserved in SCS by base64 encoded form. Binary difference will be discussed later. Reversibility of Difference The differential algorithm is either reversible or irreversible. Let arbitrary two files b1 and b2, and let the difference d. Under the condition, if b2 can be restored from b1 and d as well as b1 can be restored from b2 and d, we say the algorithm is reversible. If the algorithm is not reversible, it's called irreversible. Unidiff difference is always reversible, for example. Binary Difference Type Because there are many binary differential algorithms, the name of the algorithm will be embedded into the Hunk header. This algorithm name is called the type of the binary difference. SCS generator A program that generates correct SCS format. SCS scanner A program that reads both SCS and traditional Unidiff format. 3. Differential Unit (Unit) 3.1. Layout SCS is a sequence of several Differential Unit (or simply Unit), in a line. Junk lines between two Units are permitted, but it doesn't allowed the lines that have only zero or more blank characters. This is because unlike Unidiff format, SCS has an individable meaning as a whole. There are two types of Units. (Extended) Unidiff Unit and Binary Unit. Former may have some extended Hunks discussed below, but basically it's quite similar form with traditional Unidiff. Latter was introduced for some Item pairs by which we can not calculate Unidiff difference. Even in such cases, Binary Unit can still represent those difference. Because Binary Unit breaks the principle of readability and it has normally too many lines, they MUST be put after the all Unidiff Units. As a result, SCS is defined by a sequence of zero or more Unidiff Units and zero or more Binary Units in this order. When SCS is actually applied to target trees, the Unit applying order will be critical compared to legacy Unidiff cases. Because SCS can describe file renaming/creation/deletion as well, it can not always form valid operation to apply each Unit from the beginning to the end. The physical Unit layout order in SCS is _completely irrelevant_ to their proper applying order. It's decided only by the logical relationship among each Unit and absolutely independent from their physical Unit order. The algorithm related to Unit applying order will be discussed later. 3.2. Extended Unidiff Unit Extended Unidiff Unit consists of: (1) Extended Unit Header (2) ID Hunk (3) Sequence of Standard Hunks (4) Property Hunk in this order. (2) and (4) are optional. The last character of the Extended Markers are always "-" or "+". Here's an example of Extended Unidiff Unit: ---- ./foo.txt 2005-05-23 10:58:17.583015024 +0900 ++++ ./bar.txt 2005-05-23 10:59:12.976593920 +0900 @@ id: i_abc @@ @@ -1,3 +1,4 @@ aaaaaaaaaa +bbbbbbbbbb cccccccccc dddddddddd @@ properties: @@ -zzz: xxx +zzz: yyy We MUST also accept the traditional --- +++ style marker. It MUST be interpreted as a short cut for ---- ++++ extended marker. Almost all the Units must have this type when used by software development, because they basically manage text files only. Today's wide spreaded 'patch' tools SHOULD discard the Extended Unidiff Unit regions with warnings. They SHOULD NOT be choked by those areas. 3.3. Binary Unit Binary Unit consists of: (1) Extended Unit Header (2) ID Hunk (3) Binary Hunk (4) Property Hunk in this order. (2) and (4) are optional. (3) MUST appears exactly once. The last character of the Extended Markers are always "x". Here's an example of Binary Unit: ---x ./foo.txt 2005-05-23 10:58:17.583015024 +0900 +++x ./foo.txt 2005-05-23 10:59:12.976593920 +0900 @@ bin: plain, -1,1 +1,1 @@ -AAECAwQF +AAECAwQFBg== Today's wide spreaded 'patch' tools SHOULD discard the Binary Unit regions with warnings. They SHOULD NOT be choked by those areas. 4. Hunk This chapter is devoted to the explanation of each Hunk type. Standard Hunk appeared in 4.1. has already been used in traditional Unidiff format as well and that part is only for reviewing purpose. Rest of the sections are pure extensions by SCS. Today's wide spreaded 'patch' tools SHOULD discard all of the Hunk's regions with warnings except Standard ones. They SHOULD NOT be choked by those areas. 4.1. Standard Hunk Standard Hunk Header usually has the following format: @@ -A,B +C,D @@ A, B, C, D are called pre-offset, pre-size, post-offset, and post-size respectively. The meaning of this Hunk header is as follows. "The original part from line number A and has the length B was replaced by modified part from line number C and has the length D. The detail is presented by the following Hunk body.", where A and C are 1-origin line number. Not zero. A, B, C, D are usually all natural numbers, but there are various edge cases according to their range and relations. (1) B or D can be omitted when it equals to 1. (2) When there are no lines in original or modified text, B or D will be zero. (3) When insertion or deletion occurrs at the beginning of the text, A or C will be zero. Moreover, the following relations are hold. (4) B is always sum of deleted lines and invariant lines. D is always sum of appended lines and invariant lines. (5) If B is zero, this Hunk doesn't have any invariant lines nor deleted lines at all. In such cases, A points to just before the insertion point. If this position is the beginning of the file, A must be zero. (6) If D is zero, this Hunk doesn't have any invariant lines nor appending lines at all. In such cases, C points just before the deleting point. If this position is the beginning of the file, C must be zero. (7) If A is zero, then B is also zero. (8) If C is zero, then D is also zero. SCS scanner SHOULD read input sequence correctly even if (1) case holds. But SCS generator MUST always write all four numbers. 4.2. Binary Hunk Binary differential algorithms are divided into two types. Reversible and Irreversible. For example, two of the well known command line binary differential programs are "xdelta"[5] and "bsdiff/bspatch"[6]. Both seem irreversible. In the below explanation, we assume the following. There is a file named "binary.bin". The original content is referred as "binary.bin(1)". The modified content is referred as "binary.bin(2)". The binary difference "binary.bin(1)" and "binary.bin(2)" is "forward.bin", and its reverse difference is "reverse.bin". The base64 encoded "forward.bin", "reverse.bin" are "base64-forward.txt", "base64-reverse.txt", respectively. Using some irreversible binary differential algorithm, say "yoyodyne method", we can get the reversible binary differential information by the following way: Calculate both "base64-forward.txt" and "base64-reverse.txt", put "base64-forward.txt" lines with leading '-' characters at the beginning part of the Hunk Body, then put "base64-reverse.txt" lines with leading '+' characters at the next part of the Hunk Body. Insert binary format name and quartet number information in the Hunk Header, where pre-offset is 1, pre-size is line numbers of "base64-forward.txt", post-offset is 1, post-size is line numbers of "base64-reverse.txt", respectively. Here's an image: ---x ./binary.bin 2005-05-10 13:05:26.000000000 +0900 +++x ./binary.bin 2005-05-10 13:06:10.000000000 +0900 @@ bin: yoyodyne, -1,3 +1,3 @@ - - ... contents of base64-forward.txt ... - + + ... contents of base64-reverse.txt ... + The semantics of '-' or '+' is slightly different from Unidiff one, but we decided to use them to avoid populating special characters. Because base64 encoded text couldn't be read by everyone anyway, nobody would be confused about such lines. On the other hand, when there exists a reversible binary differential algorithm, say "strangelove method", all we have to do is calculating "base64-forward.txt" only. In the Hunk Header, pre-offset is 1, pre-size is the line numbers of "base64-forward.txt", post-offset is 0, post-size is 0, respectively: ---x ./binary.bin 2005-05-10 13:05:26.000000000 +0900 +++x ./binary.bin 2005-05-10 13:06:10.000000000 +0900 @@ bin: strangelove, -1,3 +0,0 @@ - - ... contents of base64-forward.txt ... - In the same spirit as irreversible case, we simply add '-' character to the head of each base64 encoded line. By checking both post-offset and post-size are zero, we can see whether this algorithm is reversible or not. The pre-defined binary formats are as follows: plain (irreversible): Not calculating actual difference, the original and modified version of the file are simply base64 encoded and embedded to the Hunk. The original lines are added by '-' at the head of each line, and modified lines are added by '+'. gzip (irreversible): Same as plain, except that compression must be occurred before base64 encoding. xdelta (irreversible): Using xdelta algorithm for calculation. bsdiff (irreversible): Using bsdiff/bspatch algorithm for calculation. 4.3. Property Hunk Property Hunk is provided for recording property information or meta-information in each file controlled by some modern SCMs. You can add it to not only Unidiff Unit, but also to Binary Unit. property Hunk MUST be the last Hunk in the Unit, if exists. Some property keys, just like file permission, are reserved. The list of reserved keys are as follows: permission: file permission information. The body of property Hunk MUST be sorted by their key order. This rule MUST be applied in both deleted lines part and appended lines part. We could consider the file modification time as a kind of property, but this values are changed almost every time. If it would be put into property Hunk, almost every time SCS has property Hunk in each Unit. To avoid this, modification time MUST be put in the tail of the Unit header. The binary encoding rule accepts \\, \n, \r, \t, \ooo, \xhh characters. 4.4. ID Hunk Item ID would be embedded in a special Hunk which MUST be always at the beginning of the all Hunks. This Hunk is called ID Hunk. The format of ID Hunk is as follows: If this file is tagged by GNU arch's tagline method: @@ id: i_cdbdd634-3f67-438c-97d3-a63a0699d6b9 @@ or by Subversion's node ID method: @@ id: [FIXME:] @@ ID Hunk is optional and doesn't have to always exits, but if exists, the patch operation MUST be done by the ID information and MUST NOT by the file name in the Unit header. In Subversion case, so-called node ID would be embedded into the ID Hunk. [FIXME: Am I wrong ?] ID Hunk has this line only. It doesn't have any Hunk body at all. The permitted character in the Item ID is out of scope of this document. But [-0-9a-zA-Z_]+ would be reasonable by extended regular expression. 5. Differential Calculation Given two Items which have any Item types, we can always calculate their difference. SCS allows Item type transitions even between different Item types. We have to define general differential rule even between hetero-Item types. Differential Calculation must be done on the two file contents which are properly defined for each Item type respectively. They are as follows: Item Type Contents ------------------ ------------------------------------------- directory Item or It's considered as just zero length null Item empty byte sequence. ------------------ ------------------------------------------- symbolic Item Link target path name + LF. ------------------ ------------------------------------------- regular Item Normal meaning. Data in the file themselves. ------------------ ------------------------------------------- The actual differential algorithm is as follows: (1) Check Unidiff calculation is available or not. (2) If possible, that output is just the answer. (3) If impossible, the SCS diff program uses an algorithm of its choice to generate one of the allowed binary format specified in section 4.2: Binary Hunk. We do not specify how to determine whether a Unidiff is possible. There is no simple answer and this decision is left up to the implementation. Notice that the binary differential calculation in (3) always succeeds. According to the specific Unidiff implementation, or processing system, the check result of (1) would be different from each other. So the difference SHOULD be Unidiff algorithm as much as possible. When an Unidiff calculation system is fixed, the following either result will be possible for arbitrary two Items. (a) It can be represented by both Unidiff format and binary format. (b) It can be represented only by binary format. 6. Applying Semantics Each Unit in legacy Unidiff can't handle rename information, because it has only closed information related to _single file_. It means that the applying order of Unidiff Unit doesn't affect to the result of the transformation. All of the different ordered transformation result in the same tree. But this property will not hold for SCS which can also describe even file renaming, copying, creation, and deletion as well. Take the following SCS for example: (1) "foo.txt" was removed, and (2) "bar.txt" was renamed to "foo.txt". Regardless to the physical order of (1) and (2) in the SCS, those MUST be applied from (1) to (2) order. If you apply them from (2) to (1), it's pretty obvious that both "foo.txt" and "bar.txt" will be lost out of the target. This chapter is devoted to the Unit applying order in SCS. 6.1. Notation When a Unit was changed from file f to g and it has ID Hunk valued I, we note it as (f, g, I). We will always use f, g, h, ... for file names and I, J, K ... for Item IDs. f is actually such names as "./foo.txt" and I would be some UUID-flavoured strings as "i_cdbdd634-3f67-438c-97d3-a63a0699d6b9". Each variable has index if necessary. Index will be noted by [] just like f[n]. Because ID Hunk is optional, Unit wouldn't have any Item ID. In such cases, SCS assumes any uniquely assigned Item ID that will never conflict with other Item IDs. In the below argument, every Unit has properly assigned Item ID introduced by the above convention. 6.2. Algorithm Basically the applying algorithm in this section uses a slightly modified version of GNU arch's "apply_changeset" algorithm. Using this algorithm, we can always avoid any kind of edge cases and modify the target tree properly. Let SCS Units are U[1] = (f[1], g[1], I[1]) U[2] = (f[2], g[2], I[2]) ... U[n] = (f[n], g[n], I[n]) then SCS MUST be applied by the following steps. 6.2.1. remove deleted Items from the tree. Compute the subset of U[1..n] which element's g[i] is null Item. Remove all the files in the subset from the tree. This process MUST be concerned the directory structure, that is, ordered from the deeper to the shallower. 6.2.2. Set aside renamed/copied Items from the tree, once. Compute the subset of U[1..n] which element's f[i] and g[i] differ. Set aside such files (or copy of files) out of the tree for a moment. 6.2.3. Install Items set asided in 6.2.2. When each item is installed in target tree, new name must be used. This process must consider the directory structure. You should execute from shallower ones to deeper ones. 6.2.4. Add appended Items to target tree Compute the subset of U[1..n] which element's f[i] is null Item. Install such files to the tree. This process must, again, concern the directory structure, that is, ordered from the shallower to the deeper. 6.2.5. actual patching Depending on the Unit type, actual patching process is invoked. normal patch applying for extended Unidiff Unit, binary patch applying for Binary Unit, respectively. this process must be done by the Item ID. not by the file name. How to handle In-exact patching is out of scope of this document. 7. Unit Header 7.1. Extended Marker Extended marker consists of four characters and indicates what kind of Item Type Transition occurred in it and whether the Unit is binary or not. Notice that legacy Unidiff marker had only three characters. The symbolic rules of Extended Marker are as follows: char 1. char 2. char 3. char 4. 1st line '-' '-' [*1] [*3] 2nd line '+' '+' [*2] [*4] *1 directory Item ... 'D' symbolic Item ... 'S' regular Item ... '-' null Item ... '.' *2 directory Item ... 'D' symbolic Item ... 'S' regular Item ... '+' null Item ... '.' copy Item ... 'C' *3 when extended Unidiff Unit ... '-' when Binary Unit ... 'x' *4 when extended Unidiff Unit ... '+' when Binary Unit ... 'x' 7.2. File Name Field Convention The first path component of a file name must be '.'. Special file named just one '.' is permitted and it represents the tree's root directory. The files just under '.' are referred as "./foo.txt" or so, and the "bar.txt" under the sub-directory "sub" are specified by "./sub/bar.txt" or so. Even if the file name refers to some directory, The file name MUST NOT have a trailing '/' character. Every file name doesn't have any redundant subdirectory part appeared in "diff -ur" output. When null marker is appeared in Unit header, the corresponding file name and the time field MUST be blanked. [FIXME: File name MUST NOT have any space characters.] [FIXME: non-ascii character are out of scope of this document.] 7.3. Time Field Convention Time fields SHOULD be indented to match the both start columns of first and second header lines. It increases the readability pretty well. APPENDIX A. Some Examples Here's some examples of Differential Unit: a.) Regular file "./foo.txt" was modified. It's not under Item ID control. ---- ./foo.txt 2005-05-23 10:58:17.583015024 +0900 ++++ ./foo.txt 2005-05-23 10:59:12.976593920 +0900 @@ -1,4 +1,5 @@ aaaaaaaaaa +bbbbbbbbbb cccccccccc dddddddddd eeeeeeeeee b.) All of the lines in "./foo.txt" are deleted. not under Item ID control. "./foo.txt" still remains as an empty file. ---- ./foo.txt 2005-05-23 10:58:17.583015024 +0900 ++++ ./foo.txt 2005-05-23 10:59:12.976593920 +0900 @@ -1,4 +0,0 @@ -aaaaaaaaaa -bbbbbbbbbb -cccccccccc -dddddddddd c.) Same as b.), but "./foo.txt" was actually removed. ---- ./foo.txt 2005-05-23 10:58:17.583015024 +0900 ++.+ @@ -1,4 +0,0 @@ -aaaaaaaaaa -bbbbbbbbbb -cccccccccc -dddddddddd d.) Regular file "./foo.txt" was renamed to bar.txt. The Item is managed by Item ID "i_abc". ---- ./foo.txt 2005-05-23 10:58:17.583015024 +0900 ++++ ./bar.txt 2005-05-23 10:59:12.976593920 +0900 @@ id: i_abc @@ e.) Same as d.), but the contents also modified. ---- ./foo.txt 2005-05-23 10:58:17.583015024 +0900 ++++ ./bar.txt 2005-05-23 10:59:12.976593920 +0900 @@ id: i_abc @@ @@ -1,3 +1,4 @@ aaaaaaaaaa +bbbbbbbbbb cccccccccc dddddddddd f.) Same as e.), but the property value of "zzz" is also modified from "xxx" to "yyy". ---- ./foo.txt 2005-05-23 10:58:17.583015024 +0900 ++++ ./bar.txt 2005-05-23 10:59:12.976593920 +0900 @@ id: i_abc @@ @@ -1,3 +1,4 @@ aaaaaaaaaa +bbbbbbbbbb cccccccccc dddddddddd @@ properties: @@ -zzz: xxx +zzz: yyy g.) Regular file "./foo.txt" was copied to bar.txt. The Item is managed by Item ID "i_abc". ---- ./foo.txt 2005-05-23 10:58:17.583015024 +0900 ++C+ ./bar.txt 2005-05-23 10:59:12.976593920 +0900 @@ id: i_abc @@ h.) Same as d.), but the contents also modified. ---- ./foo.txt 2005-05-23 10:58:17.583015024 +0900 ++C+ ./bar.txt 2005-05-23 10:59:12.976593920 +0900 @@ id: i_abc @@ @@ -1,3 +1,4 @@ aaaaaaaaaa +bbbbbbbbbb cccccccccc dddddddddd i.) Same as e.), but the property value of "zzz" is also modified from "xxx" to "yyy". ---- ./foo.txt 2005-05-23 10:58:17.583015024 +0900 ++++ ./bar.txt 2005-05-23 10:59:12.976593920 +0900 @@ id: i_abc @@ @@ -1,3 +1,4 @@ aaaaaaaaaa +bbbbbbbbbb cccccccccc dddddddddd @@ properties: @@ -zzz: xxx +zzz: yyy j.) Append 1-byte "\x06" was appended at the bottom of the binary file "./foo.txt" which has had the 6 byte values "\x00\x01\x02\x03\x04\x05". The name was unchanged and used "plain" binary difference type. ---x ./foo.txt 2005-05-23 10:58:17.583015024 +0900 +++x ./foo.txt 2005-05-23 10:59:12.976593920 +0900 @@ bin: plain, -1,1 +1,1 @@ -AAECAwQF +AAECAwQFBg== k.) Create a new symbolic link named "./foo.txt" which refers to "/some/directory/foo.txt". --.- ++S+ ./foo.txt 2005-05-23 10:59:12.976593920 +0900 @@ -0,0 +1,1 @@ +/some/directory/foo.txt APPENDIX B. Formal Syntax SCS := [extended Unidiff Unit]* [Binary Unit]* extended Unidiff Unit := Unidiff Unit header-1 Unidiff Unit header-2 [ID Hunk] [Unidiff Hunk]* [property Hunk] Binary Unit := Binary Unit header-1 Binary Unit header-2 [ID Hunk] Binary Hunk [property Hunk] Unidiff Unit header-1 := '--' [-.DS] '- ' header-info '\n' Unidiff Unit header-2 := '++' [+.DSC] '+ ' header-info '\n' Binary Unit header-1 := '--' [-.DS] 'x ' header-info '\n' Binary Unit header-2 := '++' [+.DSC] 'x ' header-info '\n' header-info := file-name-description ' ' time-description '\n' ID Hunk := '@@ id: ' [0-9a-zA-Z_]+ ' @@' '\n' binary Hunk := '@@ bin: ' binary-algorithm-name ', ' standard-Hunk-quartet ' @@' '\n' binary Hunk body binary-algorithm-name := 'plain' | 'gzip' | 'yoyodyne' | ... binary-Hunk-body :=[lines started by '+' and reverse binary diff data encoded by base64] standard-Hunk-quartet := '-' number ',' number ' +' number ',' number file-name-description := time-description := property-Hunk := '@@ properties: @@' '\n' property-Hunk-body [FIXME:] REFERENCES [1] GNU arch http://www.gnu.org/software/gnu-arch/ [2] Lord, T., "arch Meets hello-world - A Tutorial Introduction to The arch Revision Control System" http://www.gnu.org/software/gnu-arch/tutorial/arch.html 2001, 2002, 2003, 2004. [3] GNU diff/patch http://ftp.gnu.org/pub/gnu/patch/patch-2.5.4.tar.gz http://ftp.gnu.org/pub/gnu/diffutils/diffutils-2.8.1.tar.gz [4] Subversion http://subversion.tigris.org/ [5] xdelta http://sourceforge.net/projects/xdelta/ [6] bsdiff/bspatch http://www.daemonology.net/bsdiff/