Multi-layer architecture, scalability, multitenancy, and durability are just some of the reasons companies have been using Pulsar.
By Ben Lorica and Jesse Anderson.
With companies producing data from an increasing number of systems and devices, messaging and event streaming solutionsâparticularly Apache Kafkaâhave gained widespread adoption. Over the past year, weâve been tracking the progress of Apache Pulsar (Pulsar), a less well-known but highly capable open source solution originated by Yahoo. Pulsar is designed to intelligently process, analyze, and deliver data from an expanding array of services and applications, and thus it fits nicely into modern data platforms. Pulsar is also designed to ease the operational burdens normally associated with complex, distributed systems.
Who else is interested in Pulsar? Karthik Ramasamy, CEO of Streamlio, was kind enough to share geo-demographic data of recent visitors to the projectâs homepage:
Of the thousands of recent visitors to the site: 33% are from the Americas, 36% from Asia-Pacific, and 27% were based in the EMEA region.
While Apache Kafka is by far the most popular pub/sub solution, over the last year, weâve started to come across numerous companies that use Pulsar.. It turns out that Pulsar has a few features these companies value, including:
- Multi-layer architecture comprised of a serving layer (brokers that coordinate how messages are received, stored, processed, and delivered), a storage layer (Apache BookKeeper nodes are used to persist messages), and a processing layer (via Pulsar functions or Pulsar SQL).
- High performance and scalability: Pulsar has been used at Yahoo for several years to handle 100 billion messages per day on over two million topics. It is able to support millions of topics while delivering high-throughput and low-latency performance.
- Easily add storage or serving without having to rebalance the entire cluster: the multi-layer architecture allows for storage to be added independently of serving. One is also able to make serving and storage layer expansions without any down time.
- Support for popular messaging models including pub/sub messaging and message queuing.
- Multitenancy allows a single Pulsar cluster to support an entire enterprise and lets each team have a separate namespace with its own quotas.
- Durability (no data loss): data is replicated and synced to disk.
- Geo-replication: out-of-box support for geographically distributed applications. Pulsar supports several different modes for replicating the data between clusters.
[A version of this post appears on the O'Reilly Radar.]
Related:
- Jesse Anderson: "Reducing Operational Overhead with Pulsar Functions"
- âOne simple chart: Who is interested in Spark NLP?â
- âOne simple graphic: Researchers love PyTorch and TensorFlowâ
- Tyler Akidau: Streaming 101 and Streaming 102
- âApache Kafka and the four challenges of production machine learning systemsâ
- Jay Kreps: âBuilding Apache Kafka from scratchâ
- Karthik Ramasamy: âArchitecting and building end-to-end streaming applicationsâ
- âWhat machine learning means for software developmentâ
