Meta introduces ‘Tulip,’ a binary serialization protocol supporting schema evolution. This simultaneously addresses protocol reliability and other issues and assists us with data schematization. Tulip has multiple legacy formats. Hence, it is used in Meta’s data platform and has seen a considerable increase in performance and efficiency. Meta’s data platform is made up of numerous heterogeneous services, such as warehouse data storage and various real-time systems exchanging large amounts of data and communicating among themselves via service APIs. As the number of AI and machine learning ML-related workloads in Meta’s system increase that use data for training these ML models, it is necessary to continually work on making our data logging systems efficient. The schematization of data plays a huge role in creating a platform for data at Meta’s scale. These systems are designed based on the knowledge that every decision and trade-off impacts reliability, data preprocessing efficiency, performance, and the engineer’s developer experience. Changing serialization formats for the data infrastructure is a big bet but offers benefits in the long run that make the platform evolve over time.
The Data Analytics Logging Library is present in the web tier and the internal services, and this is also responsible for logging analytical and operational data using Scribe-a durable message queuing system used by Meta. Data is read and ingested from Scribe, which also includes a data platform ingestion service and real-time processing systems. The data analytics reading library helps deserialize data and rehydrate it into a structured payload. Logging schemas are created, updated, and deleted every month by thousands of engineers at Meta, and these logging schema data flows in petabytes range each and every day over Scribe.
Schematization is necessary to ensure that any message logged in the past, present, or future, depending on the (de) serializer’s version, can be reliably (de)serialized at any time with the utmost fidelity and no data loss. Safe schema evolution via backward and forward compatibility is the name given to this characteristic. The article’s main focus lies on the on-wire serialization format used to encode the data that is finally processed by the data platform. Compared to the two serialization formats previously utilized, Hive Text Delimited and JSON serialization, the new encoding format is more efficient, requiring 40 to 85 percent fewer bytes and 50 to 90 percent fewer CPU cycles to (de)serialize data.