There are many ways to convert well-structured JSONs into tabular formats like CSV. There is a free tool for that in almost every existing language (Ruby, Python, Go, etc…). Yet, each time I get my hands on real data the actual conversion process seems elusive. The reason is that most JSON convertors offer way too many options and can quickly become more complicated than a proper procedural conversion.
Most JSON files I need to convert are arrays containing elements (i.e. documents) with similar or identical format. Their elements are not very complex, although sometimes they may contain nested structures. All I expect from a convertor is to (1) detect the schema automatically and (2) flatten the structure, while updating the schema for that. Flattening involves both extracting embedded hashes and converting embedded arrays into strings. What I am particularly wary of is having to create outline files for JSON extraction or having to detect the schema manually in order to name the required fields.
When I am working in R, I use the excellent jsonlite package by Jeroen Ooms for such purposes. jsonlite does a good part of the convesion job automatically, but it keeps the embedded lists and vectors (R lingo for array fields) intact. That is why jsonlite’s output can’t be saved as CSV. jsonlite is quite fast and I trust its JSON parser more than other R/JSON packages. I figured out if I make an executable script out of jsonlite I can use it as my everyday conversion tool…
So, I wrote a tiny script in R that completes the flattening of JSON files by automatically converting array fields into strings. It can read the JSON data from
» Posted on 20 Aug 2015 by Mahmood S. Zargar
stdin or from the file mentioned in command-line arguments. It does not ask for any guidance on what to supress and what to retain. It flattens everything and keeps all the fields in the schema. You will, of course, need R and jsonlite to run it. You may put the script in your
bin directory and set its execution bit so you can run it like a command (
chmod +x json2csv.R). It makes a lightweight and useful everyday tool.
Lately, I have been relying on OrientDB in order to store my raw data before transferring it to the analysis software like R. OrientDB is the first multi-model database I have ever used, and has turned out to be a great choice as a pre-analysis data storage system.
Multi-model databases allow the user to save the data using one model (lets say document) and retrieving it using another (e.g. graph). In OrientDB one can query the data as simple tables, or as a document collection, or even as a graph. OrientDB support schemas, but it is pretty relax about them and allows schema-less data just like a document database. It also allows traversing from a document to the other through explicit links, just as a graph database does. But at heart OrientDB is an object-oriented database. It sees everything, any record, edge, vertex or document as a nested or nestable object. Finally, OrientDB supports user-defined functions and complex types such as DateTime. All these make it an extraordinary data gathering tool for the researcher.
What is really lacking in the OrientLand is good interfaces – and I am excluding its native web-based user interface. Although OrientDB speaks SQL and it should supposedly be easy to access using traditional database interfaces, as of now it only has a half-finished JDBC connector that regularly crashes dbeaver. It doesn’t have an equivalent of MongoDB’s MongoChef for its document interface either. Importing and exporting standard formats such as CSV and GraphML is a possibility, but depending on the size of data you deal with, can range from being a hassle to becoming impractical. Last but not least, OrientDB still doesn’t have a driver in many languages.
It is this last point that pushed me to write my own R driver, dubbed OrientR, for OrientDB. It happens to me regularly to transfer huge chunks of data from OrientDB to R, and R does not really like the JSON output of OrientDB’s REST API. So I decided to develop this lightweight REST API wrapper that can pre-chew OrientDB’s output and make it ready for R. The driver can run queries against OrientDB’s REST API, clean up the fetched resultset, take care of flattenning the JSON into a dataframe, and also do some automatic or customized type conversion. You can check OrientR out here. The readme file shows the basic usage and installation, and inside R you can use
» Posted on 05 Aug 2015 by Mahmood S. Zargar
? to look up the more sophisticated options. The objective is to have the driver directly transalte between iGraph and OrientDB some day, but this is just a start. Hope it helps some of you out there get into the multi-model world.
According to OrientDB’s website migrating your data from Neo4j is pretty staightforward, and it involves only three simple steps:
- Installing Neo4j shell tools
- Exporting your data from Neo4j in GraphML format:
export-graphml -t -o /tmp/out.graphml
- Importing your data in OrientDB from GraphML format:
create database plocal:/tmp/db/test and then
import database /tmp/out.graphml
In practice there are several limitations that may hinder this supposedly simple process. My recent attempt to transfer some of my data from Neo4j to OrientDB demonstrated at least four:
- OrientDB won’t import nodes that have more than one label associated with them in Neo4j. The solution is to transform all your nodes into single-label nodes before exporting from Neo4j.
- OrientDB won’t import nodes with properties called
label. The solution is to rename such properties, if any.
- OrientDB’s console will load the whole GraphML file into Java heap memory before importing it in the database. It will need a maximum heap size at least as large as your GraphML file. The solution is to set the maximum heap size for the console in
$orientdb/bin/console.sh. In my case this meant adding
JAVA_OPTS="-Xmx8192m" to line 43 of the script.
- OrientDB’s console is not good with parallel processing. Although database operations are mostly IO-bound, this turns out to be a limiting factor when importing graph data. The solution is to connect to OrientDB remotely rather than natively when using console. To put it in concrete terms, instead of the suggested
create database plocal:/tmp/db/test you may want to run the following command:
create database remote:localhost/test USERNAME PASSWORD plocal.
Once you are done, you can sit back and wait for hours while OrientDB imports the data. The average speed seems to be around 1000 records per second for the edges (8 cores - SSD). The vertices are imported at a much higher rate.
» Posted on 11 Apr 2015 by Mahmood S. Zargar
Sometimes data comes in objects rather than in well-known hierarchical (e.g. JSON) or tabular (e.g. CSV) formats. It is not always easy to extract the data embedded in objects, specially when the objects contain other nested objects or data structures and do not offer a string representation of their content. While it is possible to implement a serialization method for any object, one may not want to go that far if only the properties need to be extracted out of the object.
Here is a simple solution: Dump the object as YAML, remove the object headers (tags), and then reload the YAML as an object. The reloaded YAML will be a hash containing all the values saved in the properties of the original object. The hash, then, can be converted to JSON or saved in a document database for data analysis.
The Ruby code follows. Vist the Gist page to see a conversion sample.
The same method should be applicable to Python objects, except that instead of Ruby’s
» Posted on 23 Mar 2015 by Mahmood S. Zargar
!ruby/object: headers, Python’s
!python/object/apply: headers must be removed.
We all know that Neo4j’s Cypher has recently had a language upgrade to accommodate ETL. Yes, the graph engine now supports
LOAD CSV within Cypher scripts.
CSV is a great data interchange format, specially because one can edit, manipulate and filter it with simple command line tools. The only caveat is that it requires a full export of all the data into a text-only format with no widely-accepted standard for reading and writing it. You may run into issues as simple as misdetected text encoding and as complicated as converting string array fields into plain text. Furthermore,
LOAD CSV covers only the need to import.
What if we need to migrate some data back to the good old RDBMS? Isn’t there a better, one-shot way for migrating some of our precious graph data into another database?
Thanks to Neo4j’s JDBC driver, you can. But then most end-user ETL or data migration applications don’t really like a driver that speaks Cypher rather than SQL, and run into all sorts of errors while trying to automatize some aspects of your data access when using Neo4j’s JDBC driver.
One tool that allows you benefit from all the facilities offered by the standardized Java database interfaces and yet perfectly obeys your orders is Scriptella.
Below is a Scriptella configuration code that shows how to use MySQL and Neo4j JDBC drivers along with Scriptella to run an ETL process. All you need to do is to put the
» Posted on 27 Oct 2014 by Mahmood S. Zargar
etl.xml file in Scriptella’s
bin directory, the JDBC driver files in
lib directory, and run Scriptella.