Dr. Tom Redman is known to have said, “It is clear to me that data is becoming the key asset of our times. Yet most data is in pretty bad shape, most companies are not very good at putting their data to work and their organizations are singularly ill-suited for data.”
Fortunately, every day that goes by sees improvements in how organizations are able to identify and evaluate the data available to them. Processing capabilities are ever changing in response to the vast and accelerating quantities of data. But, there are a few key initiatives that should be considered as foundational components of optimizing big data pipeline creation.
1. Maintain Database Health
First, the saying “garbage in, garbage out” is never more true than when organizations are making data-driven decisions. Database health can be addressed in several ways, but it is important to remember that failures upstream of the pipeline will result in old or incorrect data in initial dependencies and will lead to decisions based on inaccurate information. As data pipelines are built, data quality checks should be designed. Separate ETL tasks can be written to automatically perform checks and raise errors if the data doesn’t meet expected standards.
2. Keep it Simple
The next critical step in building a robust data pipeline is to leave the logic as simple as you can. They don’t call it the KISS principle out of simple affection. The best way to balance the pipeline development needs with simplicity are to break large and complex transformations into discrete, easy to digest functions with specific purposes and descriptive names. Creating an intuitive naming convention provides context not only for the original developer, but for any other team members that are or eventually will be part of the project. Using these discrete, and clearly identified functions creates staged pipelines that are more maintainable because it increases the number of people who can understand and adjust the pipeline, and lowers the complexity of any changes.
3. Recycle and Reuse
Third, don’t forget that it is possible to reuse queries and functions to support different use cases. If you build discrete transformations with descriptive names, you will be able to save yourself future work by repurposing the tasks across multiple pipelines. While perspective may change, the questions and information, the sources and targets, will often be similar enough that re-coding the functionality every time is redundant.
4. Language Matters
One key differentiating feature of current pipeline building tools is the complexity of coding required. When pipelines are built with a coding language that enables portability, the entire chain of users can be empowered. Data professionals including data scientists, data engineers, data analysts, database administrators, business analysts, and analytics engineers will all be able to understand the structure. Custom or more advanced coding languages requiring specialized expertise can be expensive and it can be difficult to find talent to build, review, debut, and maintain the pipeline functionality. Leveraging languages like Python and Java reduces the need for specialized roles that could lead to an inability to equitably divide support responsibilities.
5. Backups are Mandatory
Lastly, always, always, always, always create backups of your data pipelines. There are tools available on the market that will let you economically create multiple versions of a pipeline to process the same data. Then, if something unexpected happens with the primary pipeline, a replica of that pipeline can be used to read and evaluate the production data. When data quality is a priority, the design is simple, and the structure of the pipeline is understood by the whole team, unexpected outcomes can be minimized. However, we all know that there will be errors, so ensuring that backups are available will prevent the loss or interruption of the organization’s ability to make responsible decisions based on actual information. According to Geoffrey Moore, “Without big data, you are blind and deaf and in the middle of a freeway.”
By keeping these 5 tips in mind, you can prevent your organization from getting flattened by a bus, or missing the on-ramp to success.
For more tips on how to overcome the challenges of creating effective data pipelines, check out our white paper, The 3 Common Challenges of ETL.