17 Data analysis and visualization (3)
Resources |
Revision Questions |
Information Technology IT
Login to see all questions
Click on a question to view the answer
1.
Question 1
A retail company has collected sales data for the last year. The data includes information on product category, region, and sales revenue. You need to create a data visualization to identify the top 5 performing product categories by total sales revenue. Describe the steps you would take using a spreadsheet software (e.g., Microsoft Excel, Google Sheets) to achieve this, including the specific functions or features you would use. Include a description of the resulting chart type and why it is appropriate for this task.
To identify the top 5 performing product categories, I would follow these steps:
- Data Preparation: Ensure the data is in a tabular format with columns for 'Product Category' and 'Sales Revenue'.
- Pivot Table Creation: Select the entire dataset and create a pivot table. Drag 'Product Category' to the 'Rows' area and 'Sales Revenue' to the 'Values' area. Ensure 'Sales Revenue' is set to 'Sum'.
- Sorting: Sort the pivot table in descending order based on the 'Sum of Sales Revenue' column.
- Filtering: Apply a filter to the pivot table to display only the top 5 rows.
- Chart Creation: Select the data from the pivot table (specifically the 'Product Category' and 'Sum of Sales Revenue' columns). Create a bar chart. The 'Product Category' will be on the X-axis and 'Sum of Sales Revenue' will be on the Y-axis.
Chart Type Justification: A bar chart is appropriate because it allows for a clear visual comparison of the sales revenue across different product categories. The height of each bar directly represents the total sales revenue, making it easy to identify the top performers.
2.
Question 2
A university wants to analyze student enrollment data. The data includes information on student year of study (e.g., First Year, Second Year, Third Year), department, and number of students enrolled. Design a pivot table and chart solution to answer the following questions:
- What is the total number of students enrolled in each department?
- What is the distribution of students across different years of study within each department?
Clearly specify the pivot table layout and the type of chart you would use for each question. Provide a brief explanation of why your chosen chart type is suitable.
Question 1: Total Students per Department
Pivot Table Layout:
- Rows: Department
- Values: Count of Students (ensure the 'Count of Students' field is selected)
Chart Type: A bar chart is suitable. The department names will be on the X-axis and the number of students will be on the Y-axis. This allows for a direct comparison of enrollment numbers between departments.
Question 2: Distribution of Students by Year within Department
Pivot Table Layout:
- Rows: Department
- Columns: Year of Study
- Values: Count of Students
Chart Type: A stacked bar chart is appropriate. Each bar will represent a department, and the segments within each bar will represent the number of students in each year of study. This visually shows the proportion of students in each year within each department.
3.
A retail company has collected customer data including names, addresses, phone numbers, and purchase history. The data is inconsistent, with some names having titles (e.g., "Mr. John Doe"), addresses formatted differently, and phone numbers with varying formats (e.g., "123-456-7890", "(123) 456-7890"). Describe in detail the steps you would take to perform data cleaning on this dataset. Your answer should include specific techniques and tools that could be used. Consider the potential impact of data cleaning on subsequent analysis.
Data cleaning is a crucial step in preparing data for analysis. Here's a detailed breakdown of the steps I would take:
- Name Standardization:
- Title Removal: Use string manipulation functions (e.g., in Python with regular expressions or in SQL with
REPLACE) to remove titles like "Mr.", "Ms.", "Dr." from the name field. - Whitespace Removal: Remove leading and trailing whitespace from names.
- Case Consistency: Convert names to a consistent case (e.g., all uppercase or all lowercase).
- Name Splitting: Split names into first name, middle name (if present), and last name using appropriate delimiters (e.g., hyphens, spaces). Handle cases where middle names are missing.
Tools: Python (with libraries like re and pandas), SQL, Excel.
- Address Standardization:
- Format Consistency: Use regular expressions to standardize address formats (e.g., converting all addresses to a specific format like "Street Address, City, State ZIP").
- Address Parsing: If possible, use address parsing libraries or APIs to break down addresses into components (street number, street name, city, state, ZIP code).
- Address Validation: Validate addresses against a postal address database to identify and correct errors.
Tools: Python (with libraries like addressparser), APIs (e.g., Google Maps API), specialized address validation software.
- Phone Number Standardization:
- Format Conversion: Use regular expressions to convert phone numbers to a consistent format (e.g., "123-456-7890").
- Area Code Validation: Validate area codes against a list of valid area codes.
- Remove Non-Numeric Characters: Remove any non-numeric characters (e.g., spaces, parentheses, dashes) from phone numbers.
Tools: Python (with regular expressions), SQL.
- Handling Missing Values:
- Identify Missing Values: Use functions to identify rows with missing values in key fields.
- Imputation: Consider imputing missing values using techniques like mean/median imputation (for numerical data) or mode imputation (for categorical data). Alternatively, consider removing rows with missing values if the percentage of missing data is low.
- Flagging: Create a new column to flag rows with imputed values.
Tools: Python (with pandas), SQL.
Impact on Subsequent Analysis: Data cleaning significantly improves the accuracy and reliability of subsequent analysis. Inconsistent data can lead to biased results and incorrect conclusions. Standardized data allows for more meaningful comparisons and accurate modeling. Careful consideration must be given to the potential impact of imputation techniques on the distribution of the data.