Guides: SPSS Statistical Software: Data Cleaning

Data Cleaning

For this part, we will begin using the Example dataset linked in this guide (both the Excel and CSV Example files have the same data, so either is fine to use) to show you how to perform some data cleaning procedures in SPSS. But let's start with a quick intro on what data cleaning is.

What is Data Cleaning?

Data cleaning is the process of preparing you data for analyses. It can include:

checking that the data is in the correct fields/columns
removing incomplete data cases
identifying and removing duplicate data or unneeded data columns
correcting any formatting issues or spelling errors
identifying and dealing with missing data
adding text labels to any numerically-coded nominal/categorical data
re-coding or transforming variables
if applicable, removing any personally identifying information from your dataset (i.e., names, email addresses, phone numbers, etc).

To learn how you can conduct some simple data cleaning in SPSS, look through the examples below!

Adding Labels to Variables

After you import/open the Example dataset, click on the Variable View tab at the bottom of the dataset window.

Background Info

Remember, Labels are the more descriptive names we can give to variables, and Labels are what will appear in any analyses, tables, graphs, charts, or other outputs that you create with that variable. (If you do not add a Label, the variable Name is what will appear in your outputs).

Labels are helpful if your variable Names are not very descriptive or distinct (for example if all of your variables are named Q1, Q2, Q3, etc.), or if you want to add more information about what each variable is (for example, if you used a scale that measures happiness, you may have named your variables Hap_1, Hap_2, Hap_3, etc. for each individual question of the scale. You can add Labels to each variable with the specific text for each question of that scale to help you know which question is which).

If your variables are named in a similar manner to Q1, Q2, Q3, etc., before adding Labels, you should first edit the variable Names to be some sort of shorthand name that is more descriptive and unique for each variable so you know which variable is which. For example, if your Q1 variable was a consent form question, change the name from Q1 to consent. If Q2 asked for participants' ages, change Q2 to age.

Our example dataset already has descriptive shorthand names for our variables, so we can skip this step here. Ideally your variables Names should follow a similar format to the Names in the example dataset, wherein you can easily tell what each variable is based on the shorthand Names, and then afterward, you can add Labels to any that require further clarification or details to fully understand what that variable is.

Let's Practice adding Labels

Some of our variables in the example dataset do not actually need Labels; the variables Gender, Age, and Ethnicity (for example) are already very descriptive and clear with just their Names, so we do not need to add Labels to those. The output of any analyses we run with these variables will display these variables' Names since we are not adding Labels, and that is absolutely fine in this case because the Names are very clear about what these variables are.

We’ll start with Years_Employed and add a Label that doesn't include the underscore just to clean up how this variable will display when it appears in our output analyses, tables, graphs, etc. (While this variable Name is technically very clear, it looks nicer to have a Label without the underscore that will then display in our outputs. As a reminder, Names cannot have spaces in them, they can only contain letters, numbers, and underscores. Labels can include any characters, including spaces).

To add a Label, go to Variable View of your dataset.

Click inside the Label cell for a specific variable and type (or paste) in the information you want to include. (For Years_Employed, add a Label that says Years Employed).

The table below lists our example variables and the associated Labels we will be adding to them. See the images below the table for visual aids of what adding the Labels will look like.

Name	Label
Years_Employed	Years Employed
Work_Field	Work Field
Likert_Num	Likert Numbers
Time_Task_1	Time Task - Time 1
Time_Task_2	Time Task - Time 2
Hip_Hop	Hip Hop
Exp_Group	Experimental Group
Anx_1	Anxiety - Time 1
Depress_1	Depression - Time 1
Confid_1	Confidence - Time 1
Anx_2	Anxiety - Time 2
Depress_2	Depression - Time 2
Confid_2	Confidence - Time 2
Anx_3	Anxiety - Time 3
Depress_3	Depression -Time 3
Confid_3	Confidence - Time 3

Adding Values to Variables

After you import/open the Example dataset, click on the Variable View tab at the bottom of the dataset window.

Background Info

Values are for specifying what each number means for numerically-coded categorical/nominal or ordinal variables. For example, if you collected data on participants' highest level of education, it may be reported in your dataset as 1's, 2's, 3's, 4's, etc. and not the actual words High School Diploma, Associate's Degree, Bachelor's Degree, etc. By adding Values to this education-level variable, you can tell SPSS what each numeric code means, and then for any analyses you run or tables/graphs/charts you make, SPSS will display those Values instead of the numeric codes.

Let's Practice adding Values

The variables in our example dataset that need Values are: Gender, Ethnicity, and Likert_Num (Likert Numbers).

To add Values, go to Variable View of your dataset.

Click inside the Values cell for a specific variable and then click the small button with 3 dots that appears on the right side of the cell. Let's do this for the Gender variable.
This will open up the Value editor window where you can tell SPSS what each numeric code means for a particular variable. Click on the Green Plus-Sign button to add input boxes for you to enter in each Value (numeric code) and Label (associated text label for that numeric code).
Type in the first Value and Label pair: Type 1 in the Value box and type Male in the Label box. Now click the Green Plus-Sign button to add another row of input boxes. In this second pair, type 2 in the Value box and type Female in the Label box.
Click OK to save these Value Labels. Now you have successfully added Values to the Gender variable.
Repeat these steps for the Ethnicity variable and the Likert_Num variable. There are tables below showing you what the numeric codes and associated Values are for both of these variables.

Ethnicity - Numeric Code	Values
1	American Indian or Alaska Native
2	Asian
3	Black or African American
4	Hispanic or Latino
5	Native Hawaiian or Pacific Islander
6	White

Likert_Num - Numeric Code	Values
1	Strongly Disagree
2	Disagree
3	Neither Agree nor Disagree
4	Agree
5	Strongly Agree

Here are visual aids of what adding the Values to Ethnicity and Likert_Num will look like:

Now your Variable View screen should look like this:

Adjusting the Measurement Level of Variables

After you import/open the Example dataset, click on the Variable View tab at the bottom of the dataset window.

Background Info

Measurement Level (referred to as Measure in Variable View) is for specifying the level of measurement (nominal, ordinal, or scale) that each of your variables was collected as.

Let's Check our Variables' Measurement Levels for any Errors

When you import a dataset, SPSS tries to guess the measurement level for each of your variables. Sometime SPSS incorrectly specifies the measurement level of variables. You can manually change the Measurement Level (Measure) if you notice that it was incorrectly specified for any of your variables.

In the Example dataset, 4 of our variables have incorrectly specified Measurement Levels: Likert_Num, Depress_1, Depress_2, and Confid_3. They were all classified as Nominal when they should all be Scale.

Let's fix the Measurement Level of these Variables!

To adjust Measurement Level (Measure), go to Variable View of your dataset.

Click inside the Measure cell for a specific variable and then click the little dropdown arrow that appears on the right side of the cell. Do this for Depress_1.
Now select the appropriate measurement level. For Depress_1, select Scale.
Now we've changed the measurement level for Depress_1. Repeat these steps for Depress_2, Confid_3, and Likert_Num. (All of these should have Scale for their measurement level).

Once you've finished, your screen should look like this:

What if the Scale Measurement Level is not Appearing?

Let's say you have a variable that is numeric and the measurement level was incorrectly specified by SPSS as Nominal (or Ordinal). You go to change the Measurement Level but you notice that Scale is not an option - you only see Nominal and Ordinal. What do you do?

Look at the Type column (in Variable View) for the specific variable you are wanting to change the Measurement Level of.
If the Type for that variable is listed as String, then SPSS does not allow you to select Scale as the Measurement Level (because String variables are text variables). You need to change the Type to be Numeric (or any of the appropriate numeric options based on what your variable is).
To change the Type, click in the Type cell for the variable, then click on the little button with 3 dots that appears on the right side of the cell. This opens up the Type selector window. Select Numeric (or the appropriate numeric option for the variable).
Click OK. Now you should see the updated Type for the variable.
Now you should be able to change this variable's Measurement Level to Scale.

Transforming Variables

After you import/open the Example dataset, stay in Data View.

Background Info

Transformations are used when you need to re-code any of your variables. Maybe there’s some systematic typos across your data, or you want to change a scale/numeric variable into nominal (text) categories (like changing numeric ages into age categories).

Let's Practice Transforming Variables

The variables in our example dataset that need to be Transformed are: Work_Field, Time_Task_1, Time_Task_2, and we'll also go over how you could transform Age into age categories.

Let's walk through each of these Transformations - Click through each of the tabs above to learn how to conduct these Transformations!

Transforming a String Variable into a String Variable

If you look at the column for the Work_Field variable, you'll notice that some of the responses are lowercase while the rest are uppercase. This can often happen if you have people type in their answers to a question on a survey. The issue is that SPSS is case sensitive, so it will consider the lowercase version of answers to be a completely different response than the uppercase versions (e.g., gov is considered a different answer than GOV even though they are actually meant to be the same answer, and we want them to be considered the same answer). Let’s fix Work_Field so all responses/answers are uppercase.

Transforming the Work_Field Variable

Go to Data View of your dataset.
Click on Transform on the top menubar.
Select Recode into Same Variables.
In the little pop-up window, select Work_Field and bring it over to the Variables box. Then click the Old and New Values button.
This opens up another pop-up window where you tell SPSS what the old (current) values are and the new values that you want to change them into. Now we will type in the lowercase version of each of the work field answers (gov, bus, educ, tech, health) into the Old Value box (one at a time) and then type in the corresponding uppercase version in the New Value box.
To start, type gov in the Old Value box and type GOV in the New Value box. Then click Add.
Repeat this for the rest of the work field answers - bus/BUS, educ/EDUC, tech/TECH, and health/HEALTH. (Be sure to click Add after each, including after the last one you enter, otherwise SPSS will not include it)! Your screen should look like this:
Click Continue. Then click OK on the first pop-up.
Now if you look at the column of the Work_Field variable, you should see all answers are now uppercase and uniform in structure.

This is how you transform a string variable into another string variable. Another option is to use the Find and Replace feature. This is similar to a Find and Replace you could do in Microsoft Word or Excel. Here's how you can do a Find and Replace in SPSS.

Find and Replace Feature

In Data View, select the variable/column you want to do the Find and Replace for. (Click the variable name at the top of the column and it will highlight/select that column).
Now click the Binoculars icon in the toolbar above. This will open the Find and Replace window.
Click on the Replace tab. In the Find box, type in the response you want to find. Then in the Replace with box, type in what you want to change that response to. (For example, type gov in the Find box and type GOV in the Replace with box).
Click Replace All. This will replace all of the lowercase gov into uppercase GOV. (Repeat for the other responses you want to change).

Transforming String Variables into Numeric Variables

For this example, we will be transforming the variables Time_Task_1 and Time_Task_2. Both of these variables are coded as string variables because they include text, but we want to change them to just be numbers. You can see that the responses are not even in uniform units; some are in hours, some are in minutes (min). With our transformation, we also want to put these variables in uniform units of just minutes for easier comparison.

Transforming Time_Task_1 and Time_Task_2

Go to Data View of your dataset.
Click on Transform on the top menubar.
Select Recode into Same Variables.
In the little pop-up window, click Reset to clear out the prior variable we were working with. Now select Time_Task_1 and Time_Task_2 and bring them over into the Variables box. Then click the Old and New Values button. (We can work with both Time_Task_1 and Time_Task_2 at one time because they have the same responses/answers).
This opens up another pop-up window where you tell SPSS what the old (current) values are and the new values that you want to change them into. Now we will type in the existing versions of each of the Time_Task answers (3 hours, 2 hours, 1.5 hours, 1 hour, 45 min, 30 min, 15 min) into the Old Value box (one at a time) and then type in the corresponding values in minutes (using just numbers, no text) in the New Value box.
To start, type 3 hours in the Old Value box and type 180 in the New Value box. Then click Add.

Do this for all of the Time Task answers:

Old Value	New Value
3 hours	180
2 hours	120
1.5 hours	90
1 hour	60
45 min	45
30 min	30
15 min	15

Be sure to click Add after each, including after the last one you enter, otherwise SPSS will not include it!
Your screen should look like this:

Click Continue. Then click OK on the first pop-up.
Now if you look at the columns for Time_Task_1 and Time_Task_2, you should see the transformed answers. However, upon closer inspection, we see that there was a typo in a few answers - 1 hours - which did not get transformed because we did not specify that exact spelling.
To fix this, click on Transform on the top menubar and then select Recode into same Variables.
In the pop-up window, we will see that Time_Task_1 and Time_Task_2 are still selected, which is what we want. Click Old and New Values.
Now type in the misspelling 1 hours into the Old Value box and type 60 into the New Value box. Click Add.
Click Continue. Then click OK on the first pop-up. Now we will see the misspelled 1 hours answers have been replaced with 60.
While it now appears that Time_Task_1 and Time_Task_2 are numeric variables, they are actually still coded as string variables. We need to fix this. Go to Variable View.
For both Time_Task_1 and Time_Task_2, change their Type from String to Numeric. Click in the Type cell for Time_Task_1 and then click the little button with the 3 dots that appears on the right side of the cell. In the pop-up window, select Numeric, and then click OK. Repeat this for Time_Task_2.
Now we need to change the Measurement Level (Measure) for Time_Task_1 and Time_Task_2. They are currently both Nominal. Click in the Measure cell for Time_Task_1, then click the little drop down arrow, and select Scale. Repeat this for Time_Task_2.
Now you have successfully changed Time_Task_1 and Time_Task_2 into numeric, scale variables. Your Variable View screen should look like this now:

Changing Type to Numeric and Measure to Scale is very important because if you forget to do these steps, you will not be able to run analyses for numeric variables on Time_Task_1 and Time_Task_2 - you would only be able to run analyses for String, Nominal variables.

Transforming Numeric Variables into String Variables (e.g., Categories)

What if you have a numeric variable that you want to collapse into nominal categories? You can use the Transform function to do that. We'll practice this by using the Age variable

Transforming the Age Variable into Age Categories

Click on Transform on the top menubar.
Select Recode into Different Variables.
This brings up a little pop-up window. Select Age and bring it over into the Variables box. Then, because we are creating a new variable, we need to give this new variable (i.e., the Output Variable) a Name and Label. Click inside the Name box and type Age_Cat. Click inside the Label box and type Age Category. Click Change.
This creates the new variable, but now we need to tell SPSS what values this new variable will consist of. Click the Old and New Values button.
In the new pop-up window that appears, we will use the Range options to enter in our Old Values because we are converting scale, numeric data (based on specific numeric ranges) into nominal categories. To start, under Old Value, select Range, LOWEST through value. Then type 24 in the box. Check the box labelled Output variables are strings, then type Age 18-24 in the New Value box. Then click Add.
(Note: The reason we chose Range, LOWEST through value is because we want all numbers up to 24 to be included in this category. So in terms of age, this would include 0 years old up to 24 years old. However, the youngest person in this example dataset is 18 years old, so that's why we are choosing to label this category Age 18-24. If you had a variable that included negative numbers, Range, LOWEST through value would include those negative numbers up to whatever number you enter in the box).
Under Old Value, now select just Range. You'll see there are 2 boxes for you to enter in the beginning number of your range and the final number of your range. Type 25 in the top box and type 34 in bottom box. In the New Value box, type Age 25-34 Then click Add.
(The result of this is that, in our new Age_Cat variable we are creating, anyone who is within the ages of 25 to 34 will put into the category Age 25-34).
Repeat Step 6 for the following age ranges: 35-44, 45-54, and 55-64.
Now we need to add our final category for those 65 years old or older. For this step, under Old Value, select Range, value through HIGHEST. Then type 65 in the box. Then type Age 65+ in the New Value box. Then click Add.
(Note: The reason we chose Range, value through HIGHEST is because we want all values that are 65 or greater to be included in this category. So in terms of age, this would include 65 years old up to whatever the highest age in our dataset is - in this case, it's 67).
One last thing, we need to change the Width of our soon-to-be created variable. By default, the Width is set to 8, however 8 characters is not big enough to fit all of the characters in each of our new value names. For example, Age 18-24 is 9 characters (including spaces). So in order for SPSS to not truncate our new value names, we must increase the Width to be at least the maximum number of characters of of our new value names, or larger. Let's go with 10 to be safe. Now your screen should look like like the image below. Click Continue.
Click OK on the first pop-up window, and now SPSS will create your new variable. You can verify that it was created successfully by looking at the Data View of your dataset and scrolling all the way to the right to see it.

Note: If you forgot to adjust the Width, you may notice that some of your category names were cut off, like shown below. If this happened, don't worry, just follow the next couple steps to fix this (see steps below).

Go to Variable View. Scroll down to the Age_Cat variable at the bottom. Click in the Width cell for Age_Cat, and use the arrow buttons that appear to increase the number to 10. (The number just needs to be equal to or higher than the maximum number of characters in the names of your categories, so let's just do 10 to be safe). But if you go back to Data View, you'll notice this didn't fix the issue, there's one more step.
The final step is to re-run the Transformation. Click on Transform on the top menubar, then select Recode into Different Variables. All of the information we entered in before should still be there, so just click OK to re-run this Transformation. Now if you look at this variable in Data View, you should see the full names of each category because we increased the Width in the prior step.

SPSS Statistical Software

Example Datasets

Data Science Librarian

Data Cleaning

What is Data Cleaning?

Adding Labels to Variables

Background Info

Let's Practice adding Labels

Adding Values to Variables

Background Info

Let's Practice adding Values

Adjusting the Measurement Level of Variables

Background Info

Let's Check our Variables' Measurement Levels for any Errors

Let's fix the Measurement Level of these Variables!

What if the Scale Measurement Level is not Appearing?

Transforming Variables

Background Info

Let's Practice Transforming Variables

Transforming a String Variable into a String Variable

Transforming the Work_Field Variable

Find and Replace Feature

Transforming String Variables into Numeric Variables

Transforming Time_Task_1 and Time_Task_2

Transforming Numeric Variables into String Variables (e.g., Categories)

Transforming the Age Variable into Age Categories

University Libraries