Data Engineer Interview

Data storage requirements

depends on a variety of factors, including
- the type of data
- the format of the data
- the number of users accessing the data
- the length of time the data will be retained
How much space is required to store 10 Million rows by 1000 columns in CSV (make assumptions if required)
- Depends on 3 factors:
  - Data types
    Integer data type: 4 bytes per value
    Float data type: 8 bytes per value
    Text data type: The space required will depend on the length of the text. As an estimate, we can assume an average of 50 bytes per value.
    datetime: 16 bytes per value
  - Compression
  - Indexing
- Calculation, assuming mix, 20 bytes per value, makes total of 20*1K*10M = 200 GB for CSV
- If we apply compression as well for Parquet files:
  - Using Snappy compression: 2-3x compression, hence 65-200 GB
  - Using Gzip compression: 10x compression, hence 20 GB
Storing 1000 rows by 10 Million columns and 1000 columns by 10 Million rows, would there be any difference in the space required for the storage
- It will be same, if indexing is not done while storing
How much space required to store 1 Million rows by 1000 columns csv file into json
- The space required would be 2x approx
- key, braces, new line character, colon, commas, etc. will contribute to the size
- The assumption here is that the csv file has dense data

Accuracy: This refers to the correctness of data. Free from errors and reflects the true values.
Completeness: Extent to which all required data is available. Complete if all necessary fields are filled in and there are no missing values.
Consistency: Extent to which data is uniform and consistent across different systems, applications, or time periods. Consistent if it is free from contradictions or discrepancies.
Timeliness: Extent to which data is up-to-date and reflects the current state of affairs. Timely if it is available when it is needed and is not outdated.
Relevance: Extent to which data is useful and relevant to the intended purpose.Relevant if it is applicable to the problem at hand and is useful for making decisions or taking actions.
Validity: Extent to which data is valid and conforms to the rules and constraints defined for it. Valid if it meets the specified criteria and is consistent with the intended use.
Integrity: Extent to which data is complete, accurate, and consistent across different systems, applications, or time periods. Integral if it is protected against unauthorized changes or modifications.

Inaccurate or incorrect data: Occurs when data is entered or recorded incorrectly, such as misspelled names, incorrect dates or times, or erroneous values.
Missing data: This occurs when data is not recorded or entered properly, or when certain data points are not available or are incomplete.
Duplicate data: This occurs when the same data is entered or recorded multiple times, leading to inconsistencies and confusion.
Inconsistent data: This occurs when data is entered or recorded in different formats or with different units of measurement, making it difficult to compare or analyze.
Outdated data: This occurs when data is not updated or maintained regularly, leading to inaccuracies and inconsistencies over time.
Data integration issues: This occurs when data from different sources is not properly integrated or aligned, leading to inconsistencies and errors.
Data security and privacy issues: This occurs when sensitive data is not properly secured or when privacy laws and regulations are not followed.

Define data quality standards: Define the quality standards for the data, including accuracy, completeness, consistency, timeliness, relevance, validity, and integrity.
Conduct data profiling: Conduct a thorough analysis of the data to identify potential issues, such as missing values, duplicates, inconsistencies, and outliers.
Identify the root cause of data quality issues: Analyze the data to determine the root cause of quality issues, such as incorrect data entry, system errors, or poor data integration.
Develop a data quality plan: Develop a plan to address the identified data quality issues, including strategies for data cleaning, data enhancement, and data governance.
Implement data quality processes: Implement processes to ensure ongoing data quality, such as data validation, data monitoring, and data auditing.
Use data quality tools: Use data quality tools and technologies to automate data profiling, data cleaning, and data validation.
Monitor and evaluate data quality: Monitor and evaluate data quality regularly to ensure that data meets the defined quality standards and to identify any new quality issues.

Instead of migrating the whole data first and then looking for the data quality, it is better to check the data quality for each batch of data. In general, the amount of data is huge and therefore moving all data and then checking for quality is not a correct approach
Approaches for testing through data contract for quality
- Count based approach
- Data profiling based approach: count, max, min, mean, variance, unique value, null values, distinct values
- Data classification based approach
- Data type
- Randomly sampled migrated data tested against original source data

Last updated 2 years ago

Was this helpful?