TechDoc: Using comma separated values instead of JSON or XML for efficient data transfer

Posted on May 24, 2011

0


Remote Objects, Sockets and Data

When you send data from one Remote Object to another over Sockets, you want this to happen fast and without too much waste.

I started using JSON for the HotForestGreen framework, but started to dislike it for the overhead it creates. There is too much redundancy in the way it reflects objects and too much overhead in parsing (More about the redundancy below).

Instead I started to look for a lean and mean way to serialize and de-serialize objects. You will find the primary considerations and goals in this post below.

Bigger picture

HotForestGreen is a framework written for Smart Spaces and for anything else where Remote Objects come in handy. Remote Objects are basically “copies” of objects created elsewhere and “teleported” to another machine or another application.

These Remote Objects can include to be:

  1. Records in a database
  2. Physical objects – Like lighting, doors, windows, sensors, motor driven objects and so on
  3. (Virtual) Objects in an application – Like web browsers, virtual mouses and virtual keyboards, data objects representing the state of an object and so on

When the Remote Object (including the original) gets updated, all instances of that Remote Object are “instantly” updated as well. Regardless of where they are.

To do that, you send data over the network. Data that represents that Remote Object. When your updates from Remote Objects are only sparse, it does not matter that your methods contain more fat than you need. The moment these updates are many times per second, in a constant stream, you want the best and most efficient chain of solutions you can find.  This article describes:

  1. Why XML, JSON and CSV suck for data transfer – even though they are a standard
  2. Serialization/deserialization basis for HotForestGreen – What shapes the basis for the chain of solutions used to serialize and de-serialize data in HotForestGreen.
Example code is included, including the generic code I use to parse “CSV” to any type of object in the application

Starting point: Finding a solid data format that performs

There are several goals in finding a solid data format:

  1. Compactness – As we might be in situations where a lot of data is sent over our connections, we want to reduce the overhead. All fat has to be stripped from our data-packages.
  2. Easy to parse – As parsing is one of the “weak spots” in performance, we want to reduce the overhead in parsing. The less steps we have to make to serialize objects to clear text data and parse that clear text data back to objects, the better.
  3. Performance – When serializing, de-serializing and sending data we want that to be fast. We want short transfer times, short waiting times, low strain on the processor. If we can do things in 3 steps, that is better than doing it in 5 or 25 steps. Less is more.
  4. Hard to break –  We want the data-structure to be solid. Whatever we send inside our packages, should not be able to break the structure in which we send it.
  5. Easy to use – In cases of manually produced data – like translations for buttons and content blocks in a multi-lingual site – we want the data-structure to be easy to use. People should be able to create their own data using a simple and common tool they already know.

Translations and Multi-lingual sites

One of the biggest pains in web sites for Euopean campaigns is the requirement to support multiple languages. In most cases the  text is translated while the site is being built. And the translators use software and data-structures that is not compatible with what is used to produce and fill the site.
Building this solution, intentionally only for internal use by the Framework, I realized that it also solves this pain.
Possible solution you can now use:
  1. Excel sheet per language – For each language you create a separate Excel sheet.
  2. Content per Excel sheet – Each Excel sheet contains:
    1. Variable names – In the first row of the Excel sheet, needed to map the values against your objects
    2. Content – The basic label and basic text to present in an element
    3. ID – The ID of the element against which the text should be mapped (in the first column)
    4. Empty cell for translation – An empty third cell to put the translation in
  3. Saved as Tab Separated Values – To be used by the Framework

How content is mapped and handled in the framework

  1. Persistent Objects – All Remote Objects in the framework are persistent objects.
  2. Object retrieval – When data comes in for a specific object, that object is retrieved via its Object ID.
  3. Injection of Values – The values received are then injected in the Object
  4. Dispatch of events – When the Object already exists, the Framework will dispatch an “Object Updated” event
  5. Handling of events – Your code can use that event to reflect those changes in your application. For instance: when the user changed the language and all labels and buttons are re labelled too.

Complex data structures: JSON and XML

For Complex Data Structures we have JSON or XML to serialize Objects into a transferable structure.

JSON and XML allow us to serialize Objects and Object structures and returns a clear-text variant of the object  and the data it contains.

The issue with JSON and XML: a lot of fat

JSON shares the same issue as XML: overhead, fat. Data – in the shape of variable names – that increases the weight of our packages.

Each object we serialize is serialized including the names of variables and quotes to delimit the scope of the content. When we only have one object, this is not a big issue, as it allows us to map the values to our object when we de-serialize.

It is an issue when we start to send lists.

For instance, take this JSON list of objects:

{
   {
      "firstname":"Peter";
      "lastname":"Kaptein";
      "someNumericValue":"1234";
   }
   ,
   {
      "firstname":"John";
      "lastname":"Doe";
      "someNumericValue":"5678";
   }
   ,
[...] // and so on
}

This JSON structure is using more bytes of data to describe the “object” than the values each variable contains: and the data is the stuff we really care about.

In our best case scenario, 50% of the raw data we send from “A” to “B” are values. The other 50% is overhead. In most cases of simple objects, roughly 70% to 80% of JSON is overhead.

If we use lists we can cut out a lot of redundancy. Hence: a CSV type approach.

Simple data structures: CSV

CSV stands for “Comma Separated Values”. There are several variants on how CSV is exported and not all are very consistent in what you find. But here is a representation of the above data structure:

"firstname";"lastname";"someNumericValue"
"Peter";"Kaptein";"1234"
"John";"Doe";"5678"

We cut out all redundancy and only keep the data that matters. The amount of raw data we need to send this “clear text object” is at least cut in half. Parsing this is relatively simple, where it not that the output from Excel – which is one of the possible sources of CSV data – is not following one simple line.

Parsing objects to “CSV”

We use reflection and some tricks based on reflection to create the “CSV” Text Object, which can contain one or more objects of a specific type.

Limitations (for now)

CSV in general does not cover for complex / nested objects like JSON and XML do. For instance in:

  1. Variables containing Lists and Arrays – The objects inside the variable can be anything and any number
  2. Object references – An object referred from within another object constains specific data and can be a specific object in the total structure: shared over multiple other objects

In general, to refer to other objects will require an additional structure which might be solved by inserting references to objects in a comma-separated list within the variable: to be processed in a specific way.

It is a limitation that can be solved when time is taken.

Writing about it, might even lead to a quick solution.

Why we do not send the Object Class Definition

To keep true to the CSV and JSON basis, we keep the Object Class Definition out of our “CSV” Text Object.

This information is included in our Message, which contains our “CSV” Text Object.

When our Base Framework receives a message from another Client, it can contain a reference to the Object it wants to be parsed against in the Message header. In most cases, however, we simply apply a specific Observer to all data coming in and when a message comes in for that specific Observer, we parse the data in the Class from which that Observer was created.

When we want to load a “CSV” file

When we want to load the “CSV” file, we do that from the Class in which the “CSV” is will be parsed. For instance:

MyProductClass.load(
   "mydata/myTabSeparatedProductList.txt"
   myHandlerConverter.DATATYPE_TAB_SEPARATED,
   DataLoader.TYPE_FILE);

And that is it for loading.

The Class will load and (attempt to) parse the text file to the objects represented by and inside MyProductListClass.

Once done, the onDataCallBack method will be called and executed, receiving the parsed object list.

The Handler looks something like this:

public void myHandler(object myRawList, EventParameters parm)
{
     // Cast raw object to List of type
     List<BaseRemoteObject> myList = myRawList as List<BaseRemoteObject>;
     // Do something with the list
}

Variations in- and limitations of CSV

CSV varies in how it can be output.

Excel follows a simple rule: when the content of a cell contains a value that is used as a dilimiter, we put quotes around it.

Excel output can thus look like this (brackets added by me):

some value I inserted[;"]This one contains a semi-colon; and has to be escaped["]
1234[;"]The previous was a number. This one uses ""quotes""[";"]the end["]

The way data is escaped in an Excel sheet is quite foolproof, but uses a set of different strategies to avoid the structure to be broken by a specific combinations of characters.

If I add “;” in my cell, like this: “my hack in this cell to break CSV “;” is based on faking the ; dilimter“, the “; can be interpreted as a delimiter. To avoid that, Excel escapes the “;” to “”;”” and uses some nifty trick to separate the “;” from the “”;””.

CSV is not limited to use the semi-colon. A comma can occur as well, based on your locales (country/region/type of decimal separator). Problem with a comma is that it is used more frequently and thus increases the probability of breaking the integrity of the CSV structure.

So values can be separated like this too:

my first value,"my next",1234,"some more content"

Normalization or: why CSV sucks

When you want to parse CSV, preferably the rules are simple. And only one way to create a CSV file is used.

Unfortunately this is not the case. Here is an overview of all variations:

  1. <CR><LF> – To indicate the next line/row on a PC
  2. <CR> – To indicate the next line/row on a Mac
  3. , or “, or ,” or  “,” or ; or “; or ;” or “;” – To indicate a delimiter between two cells / fields of data

Workarounds

The only two solution to work around this are:
  1. To include all variations to parse the CSV data – By using an extensive list of possible variations on the delimiter or – better – some Regular Expression magic (this article explains the issues discussed in this post from the RegEx point of view. This post claims to offer a full solution for the inconsistancies in CSV.)
  2. Drop CSV but keep the principles – We use two altrernatives which are more solid, of which the first can be read into-  and exported directly from- Excel:
    1. Tab Separated Values – For human use. We use tabs above CSV to separate values for the following reasons:
      1.  A tab is hardly used in text
      2. It can not be confused in the same way as “;” in CSV
      3. The output is consistant from Excel, as is not the case in the mess that is called CSV
    2. Secure Comma Separated Values – This structure is mainly for internal use by the Framework. We replaced the breakable separators from CSV for ASCII codes which indicate the same thing but are less likely to be inserted by the user at some point.

Regular expressions

Regular expressions are mathematical string/pattern magic conceived in a parallel coders universe. Honestly said: I do not understand them well enough to understand what is going on.

I will repeat the posts I found, leading to the regular expression that does the de-coding in HotForestGreen.

some Regular Expression magic

  1. Basic issues on CSV and regular expressionsThis article explains the issues discussed in this post from the RegEx point of view
  2. A regular expression to release you from your pain –  This post claims to offer a full solution for the inconsistancies in CSV

CSV – for compatibility sake

For maximum compatibility, CSV is included as an option for the import of hand-made object definitions.

The code to convert the CSV uses the best method I can find online, but is not guaranteed to be unbreakable, even if the CSV is correctly formatted.

It might even be slower in performance, as more different options have to be checked for each delimiter we have.

Tab Separated Values – for simplicity and speed

Tab Separated Values have two delimiters or separation codes:

  1. <Tab> – Indicates the next item/cell in our row
  2. <Enter> – Indicates the next line/row in our data set.

For content entered and modified in Excel, this is the most reliable format to export content to, for use in the Framework.

It will break on <Enters> added in the Cell, as it will not distinguish humanly added <Enters> from the ones added by Excel.

Secure Comma Separated Values – for internal solidness

To separate values according to some kind of standard, we use ASCII codes. ASCII provides a number of codes in the range between zero and 32 for different uses, including defining what is the end of a file (EOF) and moving the printer head of a mechanical printer back to the beginning (CR or “Carriage Return”) and to the next line (LF or “Line Feed”). Here is what we do within the SCSV structure in the Framework:

  1. <chr 37> – To separate values in a row we use the “Unit separator” code.
  2. <chr 30> – For each next line we use the “Record Separator” code.

Limitations

Secure Comma Separated Values and Tab Separated Values work in 99% of all cases because the probability is very low that patterns used to break elements are also occuring in the content you package within.

This is briefly why they will work in most cases:

  1. Excel allows for limited input – Within Excel <Tab> moves you to the next cell. <Enter> moves you to the next row. So separating values on <Tab> and <Enter> makes complete sense.
  2. Any other source of user generated data might be using <Enter> but not <chr 30> or <chr 37> – When you store and transfer data from a Rich Text Box, the game changes. <Tab> can be “put the next part of text on the first available tab from here” and <Enter> can be: “start a new line from here”.

Tab Separated Values will break when the content contains a <Tab> (=next item) or an <Enter> (= next row).

Secure Comma Separated Values will break when the content contains binary data that contains either <chr 37> or <chr 30>.

Not for binary data

This structure is not fit for binary data, as it uses separators which are “binary” codes as well.

In the end, all data sent over the Sockets and packaged in the Messages and “CSV” Text Objects is Binary, but with a limited set of bytes used and with specific binary values used for very specific purposes (like numbers, characters and separating lines and items of data).

A binary object, like an image, can contain any and all of this data, breaking the “CSV” and corrupting its own integrity in the process.

Parsing Tab Separated Values

There are two separators we have to deal with:

  1. <LF> – De separator for each next row. We use the common character used in both Windows and Mac.
  2. <Tab> – The separator for each next value

Parsing Secure Comma Separated Values

There are two separators we have to deal with:

  1. <chr 30> – De separator for each next row. As indicated before we strive for a safe value. The possibility of this value being inserted by a user is close to zero
  2. <chr 31> – The separator for each next value. Same as for <chr 30>.

Structure

The first line of the clear text object contains our field / variable names. We need those to map the columns to the variables in our object.

Each next line contains the values for each object in our list, where one line represents one object.

Speed

Let the code speak for itself. This is all the lines we need to convert a SCSV clear text object into usable stuff for our parser:
// Split lines
string[] SCSVcontent = SCSVTextObject.Split(new string[] {LINE_SEPARATOR},StringSplitOptions.None);

int itemLength=SCSVcontent.Count();

if(itemLength<2)
{
 // No conent. Do nothing
   return null;
}

// Get header
SCSVheader = SCSVcontent[0].Split(new string[] { ITEM_SEPARATOR }, StringSplitOptions.None);

// Parse all items with content
for(int i=1;i<SCSVcontent.Count();i++)
{
    SCSVrow=SCSVcontent[i].Split(new string[] { ITEM_SEPARATOR }, StringSplitOptions.None);
    SCSVrows.Add(SCSVrow);

}

As the separators we use (within the context) are quite “unbreakable” we do not perform any checks. Anything or anyone who breaks this structure, deserves a reward.

The variables ITEM_SEPARATOR and LINE_SEPARATOR either contain the ASCII codes for SCSV (<chr 37> and <chr 30>) or Tab Separated Values (<Tab> and <LF>).

Since we use CSV as our basis, the only thing that differs are the separators and the code can be re-used for any variation on the theme.

The result are two arrays with more stuff inside:

  1. Header – Containing the first row, which contains the mapping of each item in our data structure to a variable on the Object we want to parse it to
  2. Content – Containing the values we want to inject into each object

Mapping the values to an object

When we map data from an unknown source to an object, we need to be aware of two things:
  1. Discrepancies in variables and variable names – We received the data from somewhere else, serialized from an object that might be in a completely different environment. So where we call “first name”: “FirstName” our remote friends might call it “first_name”.
  2. Discrepancies in data types – Where we might be storing 16 bit Integers, our remote friends might be storing 64 bit Integers So when the number we receive is too big, we store something else then what we actually received. Similar to floating point numbers and their precision.

Another thing is performance.

With only 5 objects in a list, it does not matter much when we are ineficcient. But when it is 5000 objects, we want to avoid waste of processing power.

Steps taken to solve these issues and assure performance are the following:

  1. Reflecting the structure of an object only once – We only request the structure of an object once. Once we have that structure (containing a list of fields/variables we can parse against) we do not need to request it again in that round.
  2. Preparing the objects first – Instead of creating the objects we need per line we parse, we first request the objects we need, allowing us to do the next step in our optimization:
  3. Parsing per column – Instead of stepping through each Row and then parsing each Column in each Row, we take each column and run through all rows. There are several reasons to do that:
    1. Only parsing Columns we can parse –  To avoid waste and blindly try using “Try/Catch” constructions to capture failure, we use the Object we want to parse to as the Master. The data we have from our “CSV” is slave. With each variable we find in our Object, we try to find the matching column in our “CSV” dataset. When available, we run through all objects and inject the values into our already prepared and available objects.
    2. Increase in parsing speed – By reducing repetition where it costs us.  We match each available variable from the Object only once against each available in our “CSV” dataset instead of doing that over and over again for each line/row we parse.
The complete code looks as follows (unfortunately no code-highlighting). Explanation follows below:
public static List<ISCSVobject> inject(string SCSVcontentString, Type objectType)
{
// Get public properties on object
FieldInfo[] fields = objectType.GetFields();
PropertyInfo[] properties = objectType.GetProperties();

// Container for the parsed items
List<string[]> SCSVrows=new  List<string[]>();

string[] SCSVheader ;
string[] SCSVrow;

// Split lines
string[] SCSVcontent = SCSVcontentString.Split(new string[] {LINE_SEPARATOR},StringSplitOptions.None);

int itemLength=SCSVcontent.Count();

if(itemLength<2)
{
    // No content. Do nothing
    return null;
}

// Get header
SCSVheader = SCSVcontent[0].Split(new string[] { ITEM_SEPARATOR }, StringSplitOptions.None);

// Parse all items with content
for(int i=1;i<SCSVcontent.Count();i++)
{
    SCSVrow=SCSVcontent[i].Split(new string[] { ITEM_SEPARATOR }, StringSplitOptions.None);
    SCSVrows.Add(SCSVrow);

}

// ObjectList - for faster parsing
ISCSVobject myBaseObject = Activator.CreateInstance(objectType) as ISCSVobject;
ISCSVobject myObject ;
// ObjectList - for faster parsing
List<ISCSVobject> myObjectList = new List< ISCSVobject>();

// Go through header<x>.item<y>;

string objectID;

// Collect objects
for (int row = 0; row < SCSVrows.Count(); row++)
{
    SCSVrow = SCSVrows[row];

    // Our objectID is always column 0
    objectID = SCSVrow[0];

    // Get object
    myObject = myBaseObject.getObject(objectID) as ISCSVobject;

   // Add to list for fast reference, using the same index as our list
    myObjectList.Add(myObject);
}

// Prepare for conversion
string propertyName;

// Did we find property?
bool propertyFoundInSCSV;

// What is the index of our corresponding CSV column?
int columnID = 0;

// What is the value we found?
object value;

// Start conversion

// Fo each property
foreach (FieldInfo property in fields)
{   

    propertyFoundInSCSV=false;

 // Get the property name
    propertyName = property.Name;

 // Get index of property name in SCSV columns
    for (int mi = 0; mi < SCSVheader.Count(); mi++)
    {

 // Did we find the property in SCSV list?
        propertyFoundInSCSV = (SCSVheader[mi] == propertyName);
        if (propertyFoundInSCSV)
        {
            // If so: set index to proper value
            columnID = mi;
            break;
        }
    }

 // Inject values into objects, per proprty

 // Did we find object property?
    if (propertyFoundInSCSV)
    {

 // Go through list of found objects
        for (int i = 0; i < myObjectList.Count; i++)
        {
 // Get the object we identified
            myObject = myObjectList[i];

 // Get the SCSV line for that oject
            SCSVrow = SCSVrows[i];

 // Get value from columt that matches our local property
            value = SCSVrow[columnID];

           // Convert it to the proper type
            value = Convert.ChangeType(value, property.FieldType);

  // Set the value
            property.SetValue(myObject, value);

        }
    }

}
// For each property
foreach (PropertyInfo property in properties)
{

    propertyFoundInSCSV = false;

    // Get the property name
    propertyName = property.Name;

    // Get index of property name in SCSV columns
    for (int mi = 0; mi < SCSVheader.Count(); mi++)
    {

 // Did we find the property in SCSV list?
        propertyFoundInSCSV = (SCSVheader[mi] == propertyName);
        if (propertyFoundInSCSV)
        {
          // If so: set index to proper value
            columnID = mi;
            break;
        }
    }

    // Inject values into objects, per proprty

 // Did we find object property?
    if (propertyFoundInSCSV)
    {

 // Go through list of found objects
        for (int i = 0; i < myObjectList.Count; i++)
        {
     // Get the object we identified
            myObject = myObjectList[i];

          // Get the SCSV line for that oject
            SCSVrow = SCSVrows[i];

    // Get value from column that matches our local property
            value = SCSVrow[columnID];

      // Convert it to the proper type using a .NET trick
            value = Convert.ChangeType(value, property.PropertyType);

    // Set the value
            property.SetValue(myObject, value,null);

        }
    }

}
return myObjectList;
The code is quite straight forward, took about 6 hours to produce (including the serializers) and in its most basic form does the following:
  1. Get the variables from the Object we can map against – Using reflection
  2. Breaking the CSV content into pieces – We break the SCSV text object into the rows we have
  3. Extracting the header: which contains the mapping values against variables – We take the first row, break it into smaller chunks and put it in a “header” Array. This contains the names of the variables we can map the values against
  4. Extracting the data/values – Containing the values we want to inject into objects. By default the first column/item is the ID of the object
  5. Preparing the objects for injection of values – So that we can do this row by row for a more optimal way of parsing (with less cycles to run through and less work for the processor)
  6. Running through each found variable from the Object – And when we find a match in the header of our “CSV” object:
    1. Run through each row – And retrieve the value corresponding to the variable in the Object we want to inject it to
    2. Inject the value – Into the corresponding variable in the corresponding object

Conclusion

  1. Simple base structure to serialize data in – By using the principles of CSV, but by choosing different and more consistent variations on CSV we have a structure to serialize, send and deserialize Objects fast and reliable.
  2. Compact “fatless” dataset – More importantly, we get a very compact dataset which only contains the highly necessary data.
  3. No ambiguity in Secure CSV – With CSV you can get ambiguous results. Using only one way to structure the data in Secure CSV, using characters not available from the keyboard, we do not have to vary our setup.
  4. More efficient parsing – This results in a more efficient way of sending and parsing data than XML or JSON can ever reach.
  5. Humans can still produce Excel sheet based data – By choosing Tab Separated Values next to a propriatary set of separators, we can still use data produced by humans via a generic tool like Microsoft Excel.
  6. If human: Tab Separated – The only requirement for the user is to save the content as TXT file with <Tab> as a separator.
Advertisements
Posted in: Uncategorized