4shadow analytics

6/22/2023

When faced with shadow data analytics, one approach is to mandate that all staff use the sanctioned data analytics system – Mandates might be effective Over time however, mandates exact a toll in employee autonomy and initiative. And yet, while people need data to do their jobs, the practice often bypasses any sort of data governance or data security and tends to result in erroneous use of data. Growing delays when adding new ETL pipelines, generating new reports and running ad hoc queries can lead to a situation where consumers simply resort to their own sources of data – something known as “shadow data analytics”. Many companies suffer when their current data architecture and data team have reached their scale limits. For example, when the source of the defect is found to be something like a change in a field type or the way a particular field is used, there can be a lot of friction between the engineering and data teams.

In addition, troubleshooting these defects is time-consuming and frustrating.

However, the alternative is often a tightly coupled data pipeline, which is inherently fragile since data changes at the application level can result in erroneous data being fed into data lakes and subsequent defects in reports produced by data engineers. If an engineering team handles the data mesh work, their capacity for other engineering work will decrease – at least at the beginning. In a data mesh, domain teams maintain ownership of their data and create data products that expose that data to the rest of the company. Data mesh requires extra work by domain teams The solution is to use the right architecture to solve the right problems and recognize the value of your data engineers. If they don’t feel valuable, or if they feel their jobs are threatened by a data mesh, they will act against it - even though the architectures are complimentary. Use the right architecture to solve the right problems, and recognize the value of your data engineers. The drawback, however, is that very computationally intensive operations can be time-consuming. On the other hand, a data mesh can be built to include an almost arbitrary number of domains. A data lake excels at ad hoc queries and computationally intensive operations, but the centralized nature of the lake can make it hard to include pipelines from every domain in the company. In many ways, such a system is monolithic and does not scale well. Most companies struggle with scaling their data analytics operations because they insist on using the right tool for every job – even to jobs that don’t need or respond best to these tools For example, a solution where ETL pipelines pump data into a data lake has a finite capacity to deliver value. Depending on the data catalog, it may be updated directly via an API, pulling JSON files from an S3 bucket, or using other methods. With each merged pull request, updated metadata enters the DevOps pipeline and automatically updates the data catalog. For this reason, we recommend that data meshes employ a docs-as-code scheme, where updating the data catalog is part of the code review checklist for every pull request. Out-of-date documentation is often more damaging than no documentation at all. Any mechanism used for discoverability must be kept up to date to protect the usefulness of the data mesh.

0 Comments

4shadow analytics

Leave a Reply.

Author

Archives

Categories