Developing for big data systems is like driving a freight truck full of eggs on ice. It is different from driving a sedan as in enterprise systems, or a race car as in trading, or submarine rocket airplane that all machine learning and other voodoo is. The freight truck has large inertia, the ice is thin and the eggs are fragile.
The truck
My experience with interviewing for data processing/ETL jobs is that it is focused more on certain tricks of the trade that are well known and expected by both sides. Classic interview questions try to assess the knowledge of algorithms and try to develop into the classic scenarios of when data does not fit into memory, or when a systems go down. Not only interviewing, but most of the art is focused on these problems of limited memory, system partitioning, and the overall operation in an unreliable environment. This leads to the expectation that most of the time data engineers spend thinking about the scale of the problem. We like to measure success in the PBs in the data warehouses and the TB/day of the throughput. However solutions of these sort of problems are a part of the engineering lore and we know how to design for them.
The day to day thought sink is not the scale, its the inertia of the system. The only way to deal with the inertia is compatibility, making sure that nothing disrupts the delicate movement of the truck on ice. Since it is expensive to stop and start a production system, we need to be able to take down and bring up pieces of system, while keeping the integrity of the system.
The questions I find myself asking when coding are:
- if we write data in a new way, will the old code be able to parse it and continue operating (backwards compatibility)
- will the new code be able to understand the old data (forward compatibility)
- if half of the system is running the new code and the other half the old code, will anything break?
The last one is very easy to forget about. With modern deployment tools and operational maturity it's easy to regard deployment as atomic - bam and the system is running the new code. But there is always a window of time when the system is in an intermediate state when things go wrong, and it is essential to keep these scenarios in mind.
Keeping these and other scenarios in mind is mentally challenging and leads to a paranoia or worse - the fall into the sin of sloppiness. As usual in software development good design helps - the more decoupled are the subsystems and the stronger are APIs on the edges, the easier it is to concentrate on the system in development and assure its viability. Also, it helps to roll out pieces one by one and test in the intermediate steps - keeping track of the errors and issues that come up. Good development procedures, processes, QA and operations support also help, but never replace the rigorous design and thought.
Another helpful thing are the breaks. Any feature should have a way to rollback its effects, either through configuration or system settings.The other piece is the method or reprocessing if corruption occurs, which should be tested and ready before the deployment.
The ice
I often forget that the cloud is not made of water vapor, but is actual hardware made of silicon and metal which is connected to power grids, cooling systems, network hardware, and lives in a cage under a roof. Map slots are not abstract numbers on the Hadoop job tracker. Hardware breaks, machines are taken out and brought back to the grid. While in general its is safe to assume that the environment is consistent and uniform, and leave it to the underlying systems to handle the hard lifting, it helps to have the knowledge of the ops processes in mind when debugging. I had a case when a machine came back into the rack with older version of software, manifesting a problem with occasional falls on some of the mappers. It helps to keep track of the machine names where failures occur and keep in touch with the operations crew - they live the zeitgeist of the platform.
The eggs
The data, being the fragile and perishable are the metaphorical eggs. in a large scale data system, there will always be some data rows that will break the system, and finding these rows is a painful exercise. The key here is instrumentation and defensive programming. Ideally, every field read from the database should be sanitized and checked, every non-conforming bit should be logged and accounted for. However this leads to a significant overhead in production, and with that its always good to have multiple run modes for the job with varying level of verbosity.
Another problem with data is its persistent nature - once its corrupted, its hard to reverse. In most of the cases reversal is either expensive, requiring long reprocessing, or outright impossible. The general way of backups do not apply - where exactly are you going to backup petabytes of data to. The practice is to deploy changes gradually, making sure that the data stays consistent. The most important tool in this process is the diffing process, which allows to diff data in multiple locations.
The eggs truck on ice
The most important is - have fun driving.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment