У нас вы можете посмотреть бесплатно The Hidden Problem in ClickHouse Streaming Pipelines или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:
Если кнопки скачивания не
загрузились
НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу
страницы.
Спасибо за использование сервиса ClipSaver.ru
⚠️ The Hidden Trap in ClickHouse Streaming Why Your Real-Time Analytics Might Be Completely Wrong ClickHouse adoption is growing rapidly for good reason — blazing-fast queries, columnar architecture, massive dataset processing 🚀 But there's a critical issue: ❗ If your streaming pipeline isn't designed correctly, your data gets silently corrupted and dashboards show wrong numbers — without any errors or warnings. 🧩 The Common Pattern Many teams build pipelines like this: Kafka → ReplacingMergeTree → Materialized View → Aggregation Tables Looks logical: deduplication, aggregation, all automated. But this is where the problem hides. 🧠 The Root Cause 1️⃣ ReplacingMergeTree doesn't deduplicate on insert Only during background merges Duplicates exist for a while (sometimes long) 2️⃣ Materialized Views execute on raw data Before deduplication happens Result: Duplicate arrives → View fires → Aggregation updates → Source deduplicates later But aggregated stats? Corrupted forever ❌ 3️⃣ No automatic fix Once wrong, stays wrong 🌍 When Does This Happen? More often than you think: Network failures Kafka rebalancing Consumer restarts At-least-once delivery (Kafka default) Backfills and testing mistakes Result: Wrong revenue, user counts, conversion rates No errors in logs — just silent corruption 🚨 🛠️ Solutions ✅ Prevent duplicates from entering ✅ Don't rely only on ClickHouse deduplication ✅ Design idempotent summary tables ✅ FINAL is not production-ready (too expensive) ✅ Use real streaming engines for critical systems Flink, RisingWave, Materialize provide: Exactly-once semantics Proper updates and retracts True stream-level deduplication ClickHouse becomes the serving layer (where it shines) ⚡ 🏗️ Mature Architecture Kafka → Streaming Engine → ClickHouse (Correct Processing) (Fast Queries) 🎥 Hands-On Workshop Watch me demonstrate this problem live: Healthy pipeline → Duplicate data arrives → Silent corruption Why FINAL shows different numbers How to fix the architecture Includes: Complete setup (Redpanda, ClickHouse, Python) Live corruption demonstration Verification scripts All source code and configs Solutions and best practices 💡 Who Should Watch: Data engineers with streaming pipelines ClickHouse users doing real-time analytics Teams facing data reliability issues 🔗 Resources: Code: https://github.com/sepahram-school/wo... 📌 Key Takeaways: -= ReplacingMergeTree doesn't prevent duplicate inserts -= Materialized Views fire before deduplication -= Aggregations can be permanently wrong -= For critical real-time work, use proper streaming engines #ClickHouse #DataEngineering #StreamProcessing #RealTimeAnalytics #Kafka #datareliability ------------------------------------------------------------------------------ در این ویدئو نشان میدهیم که چرا در سامانههای تحلیل برخط مبتنی بر کلیکهاوس، اگر معماری جریان داده بهدرستی طراحی نشود، آمار و شاخصها میتوانند بهصورت کاملاً بیسروصدا اشتباه شوند. مسئله از اینجا شروع میشود که حذف دادههای تکراری بلافاصله هنگام ورود داده انجام نمیشود و متریالایزد ویوها نیز روی داده خام اجرا میشوند؛ در نتیجه اگر حتی یک رویداد تکراری وارد سامانه شود، محاسبات تجمیعی همان لحظه چند بار بهروزرسانی شده و این خطا برای همیشه در آمار باقی میماند، بدون آنکه هیچ خطا یا هشداری ثبت شود. در این ویدئو بهصورت عملی این مشکل را میبینید و راه حل های کلی برای رفع آن را هم با هم مرور میکنیم