Data Engineer Interview
Data storage requirements
depends on a variety of factors, including
the type of data
the format of the data
the number of users accessing the data
the length of time the data will be retained
How much space is required to store 10 Million rows by 1000 columns in CSV (make assumptions if required)
Depends on 3 factors:
Data types
Integer data type: 4 bytes per value
Float data type: 8 bytes per value
Text data type: The space required will depend on the length of the text. As an estimate, we can assume an average of 50 bytes per value.
datetime: 16 bytes per value
Compression
Indexing
Calculation, assuming mix, 20 bytes per value, makes total of 20*1K*10M = 200 GB for CSV
If we apply compression as well for Parquet files:
Using Snappy compression: 2-3x compression, hence 65-200 GB
Using Gzip compression: 10x compression, hence 20 GB
Storing 1000 rows by 10 Million columns and 1000 columns by 10 Million rows, would there be any difference in the space required for the storage
It will be same, if indexing is not done while storing
How much space required to store 1 Million rows by 1000 columns csv file into json
The space required would be 2x approx
key, braces, new line character, colon, commas, etc. will contribute to the size
The assumption here is that the csv file has dense data
How Compression works
Different compression techniques
Data Quality Aspects
What are different aspects of data quality?
Accuracy: This refers to the correctness of data. Free from errors and reflects the true values.
Completeness: Extent to which all required data is available. Complete if all necessary fields are filled in and there are no missing values.
Consistency: Extent to which data is uniform and consistent across different systems, applications, or time periods. Consistent if it is free from contradictions or discrepancies.
Timeliness: Extent to which data is up-to-date and reflects the current state of affairs. Timely if it is available when it is needed and is not outdated.
Relevance: Extent to which data is useful and relevant to the intended purpose.Relevant if it is applicable to the problem at hand and is useful for making decisions or taking actions.
Validity: Extent to which data is valid and conforms to the rules and constraints defined for it. Valid if it meets the specified criteria and is consistent with the intended use.
Integrity: Extent to which data is complete, accurate, and consistent across different systems, applications, or time periods. Integral if it is protected against unauthorized changes or modifications.
What are common data quality issues?
Inaccurate or incorrect data: Occurs when data is entered or recorded incorrectly, such as misspelled names, incorrect dates or times, or erroneous values.
Missing data: This occurs when data is not recorded or entered properly, or when certain data points are not available or are incomplete.
Duplicate data: This occurs when the same data is entered or recorded multiple times, leading to inconsistencies and confusion.
Inconsistent data: This occurs when data is entered or recorded in different formats or with different units of measurement, making it difficult to compare or analyze.
Outdated data: This occurs when data is not updated or maintained regularly, leading to inaccuracies and inconsistencies over time.
Data integration issues: This occurs when data from different sources is not properly integrated or aligned, leading to inconsistencies and errors.
Data security and privacy issues: This occurs when sensitive data is not properly secured or when privacy laws and regulations are not followed.
How to identify and resolve data quality issues?
Define data quality standards: Define the quality standards for the data, including accuracy, completeness, consistency, timeliness, relevance, validity, and integrity.
Conduct data profiling: Conduct a thorough analysis of the data to identify potential issues, such as missing values, duplicates, inconsistencies, and outliers.
Identify the root cause of data quality issues: Analyze the data to determine the root cause of quality issues, such as incorrect data entry, system errors, or poor data integration.
Develop a data quality plan: Develop a plan to address the identified data quality issues, including strategies for data cleaning, data enhancement, and data governance.
Implement data quality processes: Implement processes to ensure ongoing data quality, such as data validation, data monitoring, and data auditing.
Use data quality tools: Use data quality tools and technologies to automate data profiling, data cleaning, and data validation.
Monitor and evaluate data quality: Monitor and evaluate data quality regularly to ensure that data meets the defined quality standards and to identify any new quality issues.
Data validation or data quality check while migration
Instead of migrating the whole data first and then looking for the data quality, it is better to check the data quality for each batch of data. In general, the amount of data is huge and therefore moving all data and then checking for quality is not a correct approach
Approaches for testing through data contract for quality
Count based approach
Data profiling based approach: count, max, min, mean, variance, unique value, null values, distinct values
Data classification based approach
Data type
Randomly sampled migrated data tested against original source data
Last updated
Was this helpful?