Everything About Programming: Google Protocol Buffers (ProtoBuf)

If you have done some serialization works in Java, then you know that it's not that easy. Since default serialization mechanism is not efficient and has host of problems, see Effective Java Item 74 to 78, it's really not a good choice to persist Java object in production. Though many of efficiency shortcomings of default Serialization can be mitigated by using custom serialized format, they have their own encoding and parsing overhead. Google protocol buffers, popularly known as protobuf is an alternate and faster way to serialize Java object. It's a good alternative of Java serialization and useful for both data storage and data transfer over network. It's open source, tested and most importantly widely used in Google itself, which everyone knows put a lot of emphasis on performance. It's also feature rich and defines custom serialization formats for all data types, which means, you don't need to reinvent the wheel. It's also very productive, as a developer you just need to define message formats in a .proto file, and the Google Protocol Buffers takes care of rest of the work. Google also provides a protocol buffer compiler to generate source code from .proto file in Java programming language and an API to read and write messages on protobuf object. You don't need to bother about any encoding, decoding detail, all you have to specify is your data-structure, in a Java like format. There are more reasons to use Google protocol buffer, which we will see in next section.

Google Protocol Buffer vs Java Serialization vs XML vs JSON

You can't ignore protobuf if you care for performance. I agree that there are lot of way to serialize data including JSON, XML and your own ad-hoc format, but they all have some kind of serious limitation, when it comes to store non trivial objects. XML and JSON are equally feature rich, language independent and has lots of open source Java libraries to take care of encoding and decoding. As you can use Jackson to parse JSON object in Java, or you can use XML parsers like SAX or DOM to serialize data in XML format. It's good if you are sharing data with other applications as XML is one of the most used data transfer protocol, and JSON is close second, but they has their own problems e.g. XML is verbose, it takes lot of space to represent a small amount of data and XML parsing can impose a huge performance penalty on applications. Also, traversing an XML DOM is not as easy as setting fields in a Java class, as you need to do in Google protocol buffer. JSON is less verbose and takes less space compared to XML, but still you need to incur performance penalty on encoding/decoding. Also, another benefit of Google protocol buffer over JSON is that protobuf has strict messaging format defined using .proto files. Let's see an example protobuf object to represent an Order in .proto file.

message Order {
  required int64 order_id = 1;
  required string symbol  = 2;
  required double quantity = 3;
  required double price = 4;
  optional string text = 5;
}

By using a grammar defined, strict schema, we can realize several benefits over something like JSON, e.g. by just looking at .proto file, we know field names, which fields are required and which are optional, and more importantly data type of different fields. Google Protocol buffer, also allows you to compile .proto file into multiple target languages e.g. Java, C++, or Python.

One of the most raw but good approach for performance sensitive application is to invent their own ad-hoc way to encode data structures. This is rather simple and flexible but not good from maintenance point of view, as you need to write your own encoding and decoding code, which is sort of reinventing wheel. In order to make it as feature rich as Google protobuf, you need to spend considerable amount of time. So this approach only works best for simplest of data structure, and not productive for complex objects. If performance is not your concern than you can still use default serialization protocol built in Java itself, but as mentioned in Effective Java, it got of problems. Also it’s not good if you are sharing data between two applications which are not written in Java e.g. native application written in C++.

Google protocol buffer provides a midway solution, they are not as space intensive as XML and much better than Java serialization, in fact they are much more flexible and efficient. With Google protocol buffer, all you need to do is write a .proto description of the object you wish to store. From that, protocol buffer compiler creates a Java class that implements automatic encoding and parsing of the buffer data with an efficient binary format. This generated class, known as protobuf object, provides getters and setters for the fields that make up a protocol buffer and takes care of the details of reading and writing the protocol buffer as a unit. Another big plus is that google protocol buffer format supports the idea of extending the format over time in such a way that the code can still rad data encoded with the old format, though you need to follow certain rules to maintain backward and forward compatibility.

Google Protobuf Tutorial for Java developers

You can see protobuf has some serious things to offer, and it's certainly found its place in financial data processing and FinTech. Though XML and JSON has their wide use and I still recommend to them depending upon your scenario, as JSON is more suitable for web development, where one end is Java and other is browser which runs JavaScript. Protocol buffer also has limited language support than XML or JSON, officially good provides compilers for C++, Java and Python but there are third party add-ons for Ruby and other programming language, on the other hand JSON has almost ubiquitous language support.

In short, XML is good to interact with legacy system and using web service, but for high performance application, which are using their own adhoc way for persisting data, google protocol buffer is a good choice. Google protocol buffer also has miscellaneous utilities which can be useful for you as protobuf developer, there is plugin for Eclipse, NetBeans IDE and IntelliJ IDEA to work with protocol buffer, which provides syntax highlighting, content assist and automatic generation of numeric types, as you type. There is also a Wireshark/Ethereal packet sniffer plugin to monitor protobuf traffic.

That's all on this introduction of Google Protocol Buffer, in next article we will see How to use google protocol buffer to encode Java objects.

If you like this article and interested to know more about Serialization in Java, I recommend you to check some of my earlier post on same topic :

Top 10 Java Serialization Interview Questions and Answers (list)
Difference between Serializable and Externalizable in Java? (answer)
Why use SerialVersionUID in Java? (answer)
How to work with transient variable in Java? (answer)
What is difference between transient and volatile variable in Java? (answer)
How to serialize object in Java? (answer)

Everything About Programming

Monday, 20 July 2015

Google Protocol Buffers (ProtoBuf) - Java Serialization Alternative

Google Protocol Buffer vs Java Serialization vs XML vs JSON

No comments:

Post a Comment