Apache Kafka

Download Report

Transcript Apache Kafka

Apache Kafka
A high-throughput distributed messaging system
Johan Lundahl
Agenda
• Kafka overview
–
Main concepts and comparisons to other messaging systems
• Features, strengths and tradeoffs
• Message format and broker concepts
– Partitioning, Keyed messages, Replication
• Producer / Consumer APIs
• Operation considerations
• Kafka ecosystem
If time permits:
• Kafka as a real-time processing backbone
• Brief intro to Storm
• Kafka-Storm wordcount demo
2
What is Apache Kafka?
•
Distributed, high-throughput, pub-sub messaging system
–
•
Main use cases:
–
•
•
•
Fast, Scalable, Durable
log aggregation, real-time processing, monitoring, queueing
Originally developed by LinkedIn
Implemented in Scala/Java
Top level Apache project since 2012: http://kafka.apache.org/
Comparison to other messaging systems
–
–
Traditional: JMS, xxxMQ/AMQP
New gen: Kestrel, Scribe, Flume, Kafka
Message queues
Low throughput, low latency
Log aggregators
High throughput, high latency
RabbitMQ
JMS
ActiveMQ
Flume
Hedwig
Kafka
Scribe
Batch jobs
Qpid
4
Kestrel
Kafka concepts
Producers
Frontend
Frontend
Topic1
Topic1
Service
Topic3
Topic2
Push
Broker
Kafka
Pull
Topic3
Topic3
Topic1
Topic2
Topic3
Topic2
Topic1
Consumers
5
Monitoring
Stream
processing
Batch
processing
Data
warehouse
Distributed model
Producer
Producer
Producer
Producer persistence
Partitioned Data Publication
Broker
Broker
Broker
Zookeeper
Ordered subscription
Topic1 consumer group
6
Topic2 consumer group
Intra cluster replication
Agenda
• Kafka overview
–
Main concepts and comparisons to other messaging systems
• Features, strengths and tradeoffs
• Message format and broker concepts
– Partitioning, Keyed messages, Replication
• Producer / Consumer APIs
• Operation considerations
• Kafka ecosystem
If time permits:
• Kafka as a real-time processing backbone
• Brief intro to Storm
• Kafka-Storm wordcount demo
7
Performance factors
•
•
•
•
Broker doesn’t track consumer state
Everything is distributed
Zero-copy (sendfile) reads/writes
Usage of page cache backed by sequential
disk allocation
•
•
•
•
•
Like a distributed commit log
Low overhead protocol
Message batching (Producer & Consumer)
Compression (End to end)
Configurable ack levels
From: http://queue.acm.org/detail.cfm?id=1563874
8
Kafka features and strengths
•
•
•
•
•
•
•
•
9
Simple model, focused on high throughput and durability
O(1) time persistence on disk
Horizontally scalable by design (broker and consumers)
Push - pull => consumer burst tolerance
Replay messages
Multiple independent subscribes per topic
Configurable batching, compression, serialization
Online upgrades
Tradeoffs
•
•
•
•
Not optimized for millisecond latencies
Have not beaten CAP
Simple messaging system, no processing
Zookeeper becomes a bottleneck when using too many topics/partitions
(>>10000)
• Not designed for very large payloads (full HD movie etc.)
• Helps to know your data in advance
10
Agenda
• Kafka overview
–
Main concepts and comparisons to other messaging systems
• Features, strengths and tradeoffs
• Message format and broker concepts
– Partitioning, Keyed messages, Replication
• Producer / Consumer APIs
• Operation considerations
• Kafka ecosystem
If time permits:
• Kafka as a real-time processing backbone
• Brief intro to Storm
• Kafka-Storm wordcount demo
11
Message/Log Format
Message
Log based queue (Simplified model)
Broker
Producer API used directly by
application or through one of the
contributed implementations, e.g.
log4j/logback appender
Producer1
Producer2
Topic1
Topic2
Message1
Message1
Message2
Message2
Message3
Message3
Message4
Message4
Message5
Message5
Message6
Message6
Message7
Message7
Message8
• Batching
• Compression
• Serialization
Message9
Message10
Consumer1
Consumer2
Consumer3
Consumer3
Consumer3
ConsumerGroup1
Partitioning
Broker
Partitions
Group1
Consumer
Topic1
Producer
Consumer
Producer
Producer
Consumer
Group2
Topic2
Producer
Group3
Producer
Consumer
Consumer
Consumer
No partition for this guy
Consumer
Keyed messages
#partitions=3
hash(key) % #partitions
BrokerId=1
BrokerId=2
Topic1
Topic1
Topic1
Message1
Message2
Message3
Message5
Message4
Message7
Message9
Message6
Message11
Message13
Message8
Message15
Message17
Message10
Message12
Message14
Producer
BrokerId=3
Message16
Message18
Intra cluster replication
Replication factor = 3
Follower fails:
•
•
Follower dropped from ISR
When follower comes online again: fetch
data from leader, then ISR gets updated
Leader fails:
•
•
Detected via Zookeeper from ISR
New leader gets elected
Producer
ack
Broker1
InSyncReplicas
Broker2
Broker3
Topic1 leader
Topic1 follower
Topic1 follower
Message1
Message1
Message1
Message2
Message2
Message2
Message3
Message3
Message3
Message4
Message4
Message4
Message5
Message5
Message5
Message6
Message6
Message6
Message7
Message7
Message7
Message8
Message8
Message8
Message9
Message10
3 commit modes:
Commit mode
Latency
Durability
Fire & Forget
“none”
Weak
Leader ack
1 roundtrip
Medium
Full replication
2 roundtrips
Strong
ack
Message9
Message10
ack
Message9
Message10
Agenda
• Kafka overview
–
Main concepts and comparisons to other messaging systems
• Features, strengths and tradeoffs
• Message format and broker concepts
– Partitioning, Keyed messages, Replication
• Producer / Consumer APIs
• Operation considerations
• Kafka ecosystem
If time permits:
• Kafka as a real-time processing backbone
• Brief intro to Storm
• Kafka-Storm wordcount demo
17
Producer API
…or for log aggregation:
Configuration parameters:
ProducerType (sync/async)
CompressionCodec (none/snappy/gzip)
BatchSize
EnqueueSize/Time
Encoder/Serializer
Partitioner
#Retries
MaxMessageSize
…
Consumer API(s)
•
•
High-level (consumer group, auto-commit)
Low-level (simple consumer, manual commit)
Agenda
• Kafka overview
–
Main concepts and comparisons to other messaging systems
• Features, strengths and tradeoffs
• Message format and broker concepts
– Partitioning, Keyed messages, Replication
• Producer / Consumer APIs
• Operation considerations
• Kafka ecosystem
If time permits:
• Kafka as a real-time processing backbone
• Brief intro to Storm
• Kafka-Storm wordcount demo
20
Broker Protips
•
•
•
•
•
•
•
•
•
Reasonable number of partitions – will affect performance
Reasonable number of topics – will affect performance
Performance decrease with larger Zookeeper ensembles
Disk flush rate settings
message.max.bytes – max accept size, should be smaller than the heap
socket.request.max.bytes – max fetch size, should be smaller than the heap
log.retention.bytes – don’t want to run out of disk space…
Keep Zookeeper logs under control for same reason as above
Kafka brokers have been tested on Linux and Solaris
Operating Kafka
• Zookeeper usage
– Producer loadbalancing
– Broker ISR
– Consumer tracking
• Monitoring
– JMX
– Audit trail/console in the making
Distribution Tools:
•
•
•
•
•
•
•
Controlled shutdown tool
Preferred replica leader election tool
List topic tool
Create topic tool
Add partition tool
Reassign partitions tool
MirrorMaker
Multi-datacenter replication
23
Agenda
• Kafka overview
–
Main concepts and comparisons to other messaging systems
• Features, strengths and tradeoffs
• Message format and broker concepts
– Partitioning, Keyed messages, Replication
• Producer / Consumer APIs
• Operation considerations
• Kafka ecosystem
If time permits:
• Kafka as a real-time processing backbone
• Brief intro to Storm
• Kafka-Storm wordcount demo
24
Ecosystem
Producers:
•
Java (in standard dist)
•
Scala (in standard dist)
•
Log4j (in standard dist)
•
Logback: logback-kafka
•
Udp-kafka-bridge
•
Python: kafka-python
•
Python: pykafka
•
Python: samsa
•
Python: pykafkap
•
Python: brod
•
Go: Sarama
•
Go: kafka.go
•
C: librdkafka
•
C/C++: libkafka
•
Clojure: clj-kafka
•
Clojure: kafka-clj
•
Ruby: Poseidon
•
Ruby: kafka-rb
•
Ruby: em-kafka
•
PHP: kafka-php(1)
•
PHP: kafka-php(2)
•
PHP: log4php
•
Node.js: Prozess
•
Node.js: node-kafka
•
Node.js: franz-kafka
•
Erlang: erlkafka
Consumers:
•
Java (in standard dist)
•
Scala (in standard dist)
•
Python: kafka-python
•
Python: samsa
•
Python: brod
•
Go: Sarama
•
Go: nuance
•
Go: kafka.go
•
C/C++: libkafka
•
Clojure: clj-kafka
•
Clojure: kafka-clj
•
Ruby: Poseidon
•
Ruby: kafka-rb
•
Ruby: Kafkaesque
•
Jruby::Kafka
•
PHP: kafka-php(1)
•
PHP: kafka-php(2)
•
Node.js: Prozess
•
Node.js: node-kafka
•
Node.js: franz-kafka
•
Erlang: erlkafka
•
Erlang: kafka-erlang
Common integration points:
Stream Processing
Storm - A stream-processing framework.
Samza - A YARN-based stream processing framework.
Hadoop Integration
Camus - LinkedIn's Kafka=>HDFS pipeline. This one is used for all data at
LinkedIn, and works great.
Kafka Hadoop Loader A different take on Hadoop loading functionality from
what is included in the main distribution.
AWS Integration
Automated AWS deployment
Kafka->S3 Mirroring
Logging
klogd - A python syslog publisher
klogd2 - A java syslog publisher
Tail2Kafka - A simple log tailing utility
Fluentd plugin - Integration with Fluentd
Flume Kafka Plugin - Integration with Flume
Remote log viewer
LogStash integration - Integration with LogStash and Fluentd
Official logstash integration
Metrics
Mozilla Metrics Service - A Kafka and Protocol Buffers based metrics and
logging system
Ganglia Integration
Packing and Deployment
RPM packaging
Debian packaginghttps://github.com/tomdz/kafka-deb-packaging
Puppet integration
Dropwizard packaging
Misc.
Kafka Mirror - An alternative to the built-in mirroring tool
Ruby Demo App
Apache Camel Integration
Infobright integration
What’s in the future?
•
•
•
•
•
•
•
•
•
•
Topic and transient consumer garbage collection (KAFKA-560/KAFKA-559)
Producer side persistence (KAFKA-156/KAFKA-789)
Exact mirroring (KAFKA-658)
Quotas (KAFKA-656)
YARN integration (KAFKA-949)
RESTful proxy (KAFKA-639)
New build system? (KAFKA-855)
More tooling (Console, Audit trail) (KAFKA-266/KAFKA-260)
Client API rewrite (Proposal)
Application level security (Proposal)
Agenda
• Kafka overview
–
Main concepts and comparisons to other messaging systems
• Features, strengths and tradeoffs
• Message format and broker concepts
– Partitioning, Keyed messages, Replication
• Producer / Consumer APIs
• Operation considerations
• Kafka ecosystem
If time permits:
• Kafka as a real-time processing backbone
• Brief intro to Storm
• Kafka-Storm wordcount demo
27
Stream processing
Kafka as a processing pipeline backbone
Producer
Producer
Producer
Process1
Kafka
topic1
Process1
Process2
Kafka
topic2
Process1
Process2
Process2
System1
System2
What is Storm?
•
Distributed real-time computation system with design goals:
–
–
–
–
–
•
•
•
•
29
Guaranteed processing
No orphaned tasks
Horizontally scalable
Fault tolerant
Fast
Use cases: Stream processing, DRPC, Continuous computation
4 basic concepts: streams, spouts, bolts, topologies
In Apache incubator
Implemented in Clojure
Streams
an [infinite] sequence (of tuples)
(timestamp,sessionid,exception stacktrace)
(t4,s2,e2)
(t3,s3)
(t2,s1,e2)
(t1,s1,e1)
Spouts
a source of streams
(t4,s2,e2)
(t3,s3)
(t2,s1,e2)
(t1,s1,e1)
Connects to queues, logs, API calls, event data.
Some features like transactional topologies (which gives exactly-once messaging semantics) is only possible
using the Kafka-TransactionalSpout-consumer
30
Bolts
(t5,s4)
31
(t4,s2,e2)
•
•
Filters
Transformations
•
•
•
•
•
Apply functions
Aggregations
Access DB, APIs etc.
Emitting new streams
Trident = a high level abstraction on top of Storm
Topologies
(t5,s4)
32
(t4,s2,e2)
Storm cluster
Deploy
Compare with Hadoop:
Topology
(JobTracker)
Nimbus
Zookeeper
Supervisor
Supervisor
Supervisor
Mesos/YARN
33
Supervisor
Supervisor
(TaskTrackers)
Links
Apache Kafka:
Papers and presentations
Main project page
Small Mediawiki case study
Storm:
Introductory article
Realtime discussing blog post
Kafka+Storm for realtime
BigData Trifecta blog post: Kafka+Storm+Cassandra
IBM developer article
Kafka+Storm@Twitter
BigData Quadfecta blog post
34