[Tensorflow 2.0] Load CSV to tensorflow

A Ydobon
15 min readSep 20, 2019

--

There will be parts that are revised and elaborated for better understanding, however, I hereby acknowledge that the following post is based on TensorFlow tutorial provided in:

For more detailed explanations and background knowledge regarding codes and dataset, you can always consult the link.

side note) GOOD NEWS! The last post on saving model was published in Analytics Vidhya!!! Hope that our posts could be read by more people who are interested in programming and tensorflow! Thank you for your interest, this means a lot to us :)

I. Introduction

What is Loading a Data?

Loading a data is the very beginning of any coding. You cannot make a salad without tomatoes, lettuce, salami or whatever ingredients you’d like. Likewise, we need ingredients to train and predict. The process of bringing the ingredients on your cutting board is called ‘Load data’.

For example, MNIST dataset was (the data we used in save and reload model post) designed by Google for us to practice so there was not much of a burden regarding loading the dataset. However, in real life, we will face datasets that are analyzed in different formats and we definitely will have to cleanup the dataset or process before any further steps; training/testing, etc.

If you want to make a legit salad you should prepare your own veggies rather than buying a salad package. Now let’s get out of our comfort zone and try loading data that is not designed for a Python or tensorflow environment.

What is a CSV File?

Wikipedia says that CSV file is an Excel file which takes comma as its separator. If this formal definition bothers you; it is an excel file type that facilitates the process of loading and processing data in other programming environment.

Why do we want to Load a ‘CSV data’?

CSV file is one of the most broadly used format in the universe to save data. Whatever country you are in it is highly probable that governments and firms offer data in CSV format. (Although people say that it is an era of Python, Bill Gates is still out there scraping money$$.) So, it will be very useful to learn how to load a data from a CSV file.

Also, one of the most interesting progresses that tensorflow 2.0 achieved is that at last it can load and process data directly from files in diverse formats. Can you get the innovation by comparing the screenshots?

If tensorflow made such a progress we want to keep up with the change!

If you have loaded CSV file in previous tensorflow version, the upcoming process might seem unnecessary and boring. The following is a code that was written in kaggle.com and it (read_csv) seems way easier than what we are going to face.

df = pd.read_csv('../input/iwildcam-2019-fgvc6/train.csv')

f = df['file_name']
id = df['category_id']

all_image_paths = ['../input/iwildcam-2019-fgvc6/train_images/' + fname for fname in f]
all_image_labels = [i for i in id]

paths_labels = dict(zip(all_image_paths[0:img_size], all_image_labels[0:img_size]))

However, with this newly updated coding tutorial we can now load a CSV data directly(not through pandas) from a file into tf.data.Dataset. A basic intention of tensorflow is to convert any data format to a dataset to facilitate modeling. So, our focus is not just on reading csv file but on saving it into a dataset.

Then, let’s begin our journey!

II. Set up

  1. Technical setup
try:
# %tensorflow_version only exists in Colab.
%tensorflow_version 2.x
except Exception:
pass

If you want to run more about Try and Except, or %magic functions please do consult the following post;

from __future__ import absolute_import, division, print_function, unicode_literals
import functools

import numpy as np
import tensorflow as tf

If you want to run more about this from __future__ import absolute_import, division, print_function, unicode_literals consult the link;

Sorry that they are only offered in Korean for now, probably we will be translating it sooner or later!

2.Take A Look at our Dataset

A training dataset contains a list of 627 Titanic passengers with their information (sex, age, ticket class, etc).

training data

For your information some clarification on the column labels that might confuse you.

  • Survived: Survived (1) or died (0)
  • n_siblings_spouses: Number of siblings or spouses that a passenger boarded with
  • parch: Number of parents or children that a passenger boarded with.
  • embark_town: which port a passenger embarked. (might show where you live so how wealthy you are)

And in the end we are going to predict how probable it is for a passenger with certain characteristics to survive in a catastrophe.

side note) It would feel creepy if we can accurately predict whether or not a person can survive from a sinking ship with certain probability. And this model actually does predict their survival with confidence.

*Spoil Alert* Who did survive in the movie Tatanic, Leonardo Dicaprio or Kate Winslet? What features might have affected her survival except ‘love of Jack’? Or she might have survived regardless of sacrifice of Jack, maybe it would have been easier for her to get a seat in the lifeboat if there had not been Jack. :(

last moment of Jack and Rose in the movie Titanic

3. Download data from an URL

TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TEST_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"

train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file("eval.csv", TEST_DATA_URL)
# Make numpy values easier to read.
np.set_printoptions(precision=3, suppress=True)

IV. Load Data

Although I actually downloaded the CSV file and provided the screenshot, sometimes it is impossible or not recommended to download an actual CSV file on your laptop.

In that case we can see the top of the CSV file in the following way.

!head {train_file_path}

the result is the same as what the screenshoot of the train.csv shows.

We want to keep the ‘survived’ column with labels 0 or 1 under name ‘LABEL_COLUMN’ and ‘LABELS’ for convenience. Since this is the data we are going to predict in this model, we are going to (with 100%) revisit this value.

LABEL_COLUMN = 'survived'
LABELS = [0, 1]

NOW we are going to read the CSV data from the file and create our dataset to play with.

def get_dataset(file_path, **kwargs):
dataset = tf.data.experimental.make_csv_dataset(
file_path,
batch_size=5, # Artificially small to make examples easier to show.
label_name=LABEL_COLUMN,
na_value="?",
num_epochs=1,
ignore_errors=True,
**kwargs)
return dataset
  • We will define get_dataset function for future use. It takes file_path as a input. (we have two file paths: train_file_path and test_file_path)
  • Also, **kwargs is your insurance (abbreviation of keyword argument), you might want to add some argument later on, so you want to give future yourself a mercy.
    We actually will be covered by the insurance!
    temp_dataset = get_dataset(train_file_path, column_names=CSV_COLUMNS)

Let’s take a deeper look at how tf.data.experimental.make_csv_dataset works and what do the arguments mean. Please consult the link if you are interested. (I will only cover the arguments that are used in the coding).
https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/data/experimental/make_csv_dataset

**Caution** it is experimental so is very eligible for changes or updates by tensorflow.

1. Basic mechanism.

tf.data.experimental.make_csv_dataset reads CSV file into a dataset in a batch represented as (features or examples,labels) tuple format that corresponds to a row in CSV file.
Then, ‘features dictionary’ maps the column names to ‘tensors’ containing the corresponding ‘feature’ data and labels is also a ‘tensor’ containing the batch’s label data.

The paragraph was copied from the link provided. A part that a newcomer may not (with 99% probability) is “TENSOR”. The name itself does not give you any intuition.

from muggle.net

I am understanding ‘tensor’ as Hermione’s beaded bag.
Is there any Harry Potter fans out there? If you are not one you’d better learn that before any tensorflow :)
Remember that Hermione used extension charm on a small handbag? She carries the bag with her throughout Harry Potter and the Deathly Hallows and she has everything that might be needed during their journey; from a tent to a medicine.
Similar to Hermione’s magical bag, our tensor can contain anything in it and the size of tensor or the size of data that tensor contains is called “rank

  • if a tensor contains a scalar (a constant number) rank=0
    Why do you need an extension charm if you are going to carry a lipstick in your handbag like a muggle.
  • for a single row or column vector rank=1
  • for any matrix with n by m dimension (n,m is any number larger or equal to 2) rank=2.
  • and from now on 3-dimensional data will have rank of 3 and so on.

So, you can keep any size of data because tensor can be extended accordingly and you can alway call your data in tensor with the accio spell (the summoning spell). In the novel Harry potter always uses accio spell to find something in the magic bag. We do the same mechanism. Isn’t it magical?

2. The following arguments

  • file_path: a list of file paths containing CSV records
  • batch_size: an integer representing the number of records to combine in a single batch. So, one batch will contain 5 CSV rows.
  • label_name: we defined LABEL_COLUMN to have label name ‘predicted’.
  • na_value: Additional string to recognize NA/NaN and in our function we want to put a question mark if there were NA/NaN.
  • num_epochs: An int specifying the number of times this dataset is repeated.

Then we can actually use the function to read the file then create a dataset.

raw_train_data = get_dataset(train_file_path)
raw_test_data = get_dataset(test_file_path)

We want to explicitly see how it looks like, so we are going to define show_batch function and then will see how the returned raw_train_data looks like.

def show_batch(dataset):
for batch, label in dataset.take(1):
for key, value in batch.items():
print("{:20s}: {}".format(key,value.numpy()))
show_batch(raw_train_data)

sex : [b’female’ b’female’ b’male’ b’female’ b’male’]
age : [51. 28. 31. 28. 43.]
n_siblings_spouses : [1 0 0 1 0]
parch : [0 0 0 0 0]
fare : [77.958 79.2 7.775 15.5 8.05 ]
class : [b’First’ b’First’ b’Third’ b’Third’ b’Third’]
deck : [b’D’ b’unknown’ b’unknown’ b’unknown’ b’unknown’]
embark_town : [b’Southampton’ b’Cherbourg’ b’Southampton’ b’Queenstown’ b’Southampton’]
alone : [b’n’ b’y’ b’y’ b’n’ b’y’]

Here you can see that the all column names are automatically read by tensorflow. However there are some other possibilities.

  1. If there are no column names in your file or not in the first row, you can create a list of column names and then pass it to the column_names argument in tf.data.experimental.make_csv_dataset (here, we created our own function get_dataset)
CSV_COLUMNS = ['survived', 'sex', 'age', 'n_siblings_spouses', 'parch', 'fare', 'class', 'deck', 'embark_town', 'alone']

temp_dataset = get_dataset(train_file_path, column_names=CSV_COLUMNS)

show_batch(temp_dataset)

sex : [b’male’ b’male’ b’female’ b’male’ b’male’]
age : [26. 34. 28. 21. 45.]
n_siblings_spouses : [0 1 2 0 0] …

alone : [b’y’ b’n’ b’n’ b’n’ b’y’]

2. If you only want to use certain column data, do the similar process as the previous one. Create a list of column names of your interest and then pass it to select_columns argument.

SELECT_COLUMNS = ['survived', 'age', 'n_siblings_spouses', 'class', 'deck', 'alone']

temp_dataset = get_dataset(train_file_path, select_columns=SELECT_COLUMNS)

show_batch(temp_dataset)

age : [28. 28. 25. 25. 19.]
n_siblings_spouses : [0 0 1 0 0]
class : [b’Third’ b’Third’ b’First’ b’Third’ b’Third’]
deck : [b’unknown’ b’unknown’ b’B’ b’F’ b’unknown’]
alone : [b’y’ b’n’ b’n’ b’y’ b’y’]

V. Data Preprocessing

Up until now we loaded CSV file and saved it into tensors.
We opened our refrigerator and picked some tomatoes and other ingredients and put them on our cutting board (not yet maybe; you’ve got to wash them first). What are ahead of us is chopping them up into same sizes. In data science the chopping process is called data preprocessing.

Basically what we are going to do is

  1. Divide the data into types of continuous data and categorical data.
  2. Convert the data into a fixed length vector for better modeling.
  3. Combine the two column data for modeling

A. Continuous Data

i) We have 5 columns that have continuous type data, so we will only select the columns 'survived', 'age', 'n_siblings_spouses', 'parch', 'fare'

SELECT_COLUMNS = ['survived', 'age', 'n_siblings_spouses', 'parch', 'fare']
DEFAULTS = [0, 0.0, 0.0, 0.0, 0.0]
temp_dataset = get_dataset(train_file_path,
select_columns=SELECT_COLUMNS,
column_defaults = DEFAULTS)

show_batch(temp_dataset)

age : [24. 18. 60. 28. 28.]
n_siblings_spouses : [1. 1. 1. 0. 0.]
parch : [2. 0. 1. 0. 0.]
fare : [65. 17.8 79.2 7.75 26.55]

ii) Then, we will pack all the columns with tf.stack function. We will be ‘packing’ various times so, we will define the process into pack(features, label).

def pack(features, label):
return tf.stack(list(features.values()), axis=-1), label

Then, we will apply the pack() into elements of the dataset.

packed_dataset = temp_dataset.map(pack)

for features, labels in packed_dataset.take(1):
print(features.numpy())
print()
print(labels.numpy())

[[24. 0. 0. 13. ] [65. 0. 0. 26.55] [22. 0. 0. 7.25] [37. 1. 0. 26. ] [60. 1. 1. 79.2 ]]

[0 0 0 0 1] <-labels (when we defined pack() we made it to return labels)

compare the result with the following.
age : [24. 18. 60. 28. 28.]
n_siblings_spouses : [1. 1. 1. 0. 0.]
parch : [2. 0. 1. 0. 0.]
fare : [65. 17.8 79.2 7.75 26.55]

We were able to arrange the numeric data into a list of 5 lists and each sublist is an information of one person. For example the first person with [24. 0. 0. 13. ] has label 0. So 24 years-old who boarded with a spouse and 2 children (it might have been a sibling and their both parents, but for instance) and paid 65 dollar did not survive(label:0) :(
And we can do the same analysis on other passengers. Let’s look what kind of characteristic did the only survived passenger in our batch has; [60. 1. 1. 79.2 ]. He or she was 60 years old, boarded with one sibling or a spouse and one parent or a child and paid 79.2$ and survived!

iii) Since we have both continuous and categorical data type, we want to separate the numeric data from others and view [[24. 0. 0. 13. ] [65. 0. 0. 26.55] [22. 0. 0. 7.25] [37. 1. 0. 26. ] [60. 1. 1. 79.2 ]] as a single column data. We will use a preprocessor that selects the list of numeric data and pack them into a single column.

example_batch, labels_batch = next(iter(temp_dataset))class PackNumericFeatures(object):
def __init__(self, names):
self.names = names

def __call__(self, features, labels):
numeric_freatures = [features.pop(name) for name in self.names]
numeric_features = [tf.cast(feat, tf.float32) for feat in numeric_freatures]
numeric_features = tf.stack(numeric_features, axis=-1)
features['numeric'] = numeric_features

return features, labels
NUMERIC_FEATURES = ['age','n_siblings_spouses','parch', 'fare']

packed_train_data = raw_train_data.map(
PackNumericFeatures(NUMERIC_FEATURES))

packed_test_data = raw_test_data.map(
PackNumericFeatures(NUMERIC_FEATURES))
show_batch(packed_train_data)

sex : [b’male’ b’female’ b’male’ b’male’ b’male’]
class : [b’Third’ b’Third’ b’First’ b’Third’ b’First’]
deck : [b’unknown’ b’unknown’ b’C’ b’unknown’ b’C’]
embark_town : [b’Southampton’ b’Southampton’ b’Southampton’ b’Southampton’ b’Cherbourg’]
alone : [b’n’ b’n’ b’n’ b’y’ b’n’]
numeric : [[ 1. 1. 2. 20.575] [ 41. 0. 2. 20.212] [ 64. 1. 4. 263. ] [ 18. 0. 0. 7.75 ] [ 49. 1. 0. 89.104]]

iv) Last but not least :) , we have to normalize continuous data.

‘age’ has a number between 0.75 to 80 in our dataset. ‘n_sibilings_spouses’ : between 0 and 8.
‘parch’ : between 0 and 5.
‘fare’ : between 0 and 512.

Here we do NOT assume that our data has ‘Normal Distribution’. What we want to do is ‘normalize’ the data. Since our numeric data are distributed in an unorganized way we want to organize them by subtracting mu (average/expected value) and divide it by sigma². The process will result in numbers that are distributed with expectation 0 and with variance of 1². Basically it is similar to standardization of Normal distribution, except that we do not assume normal distribution. (Being free from normal distribution assumption or CLT is the beauty of machine learning).

import pandas as pd
desc = pd.read_csv(train_file_path)[NUMERIC_FEATURES].describe()
desc
MEAN = np.array(desc.T['mean'])
STD = np.array(desc.T['std'])
def normalize_numeric_data(data, mean, std):
# Center the data
return (data-mean)/std

Now we want to create a normalized numeric column.

First, we will use functools.partial to bind the MEAN and STD to the defined normalize_numeric_data ()

normalizer = functools.partial(normalize_numeric_data, mean=MEAN, std=STD)

Now using tf.feature_columns.numeric_column apply the normalizer function to each batch of the numeric i.e.,[[ 1. 1. 2. 20.575] [ 41. 0. 2. 20.212] [ 64. 1. 4. 263. ] [ 18. 0. 0. 7.75 ] [ 49. 1. 0. 89.104]]

numeric_column = tf.feature_column.numeric_column('numeric', normalizer_fn=normalizer, shape=[len(NUMERIC_FEATURES)])
numeric_columns = [numeric_column]
numeric_column

We can check what we did in iv) by comparing the results of following lines.

example_batch['numeric']

array([[ 21. , 2. , 2. , 262.375],
[ 39. , 0. , 0. , 26. ],
[ 28. , 0. , 0. , 7.729],
[ 37. , 2. , 0. , 7.925],
[ 20. , 0. , 0. , 9.225]], dtype=float32)>

numeric_layer = tf.keras.layers.DenseFeatures(numeric_columns)
numeric_layer(example_batch).numpy()

array([[-0.69 , 1.264, 2.043, 4.176],
[ 0.749, -0.474, -0.479, -0.154],
[-0.13 , -0.474, -0.479, -0.488],
[ 0.589, 1.264, -0.479, -0.485],
[-0.77 , -0.474, -0.479, -0.461]], dtype=float32)

A statistical problem here would be that this normalization requires information about mean beforehand.

B. Categorical Data

We have 5 column data are left; ‘sex’, ‘class’, ‘deck’, ‘embark_town’, ‘alone’ and these are categorical columns. in CATEGORIES we will assign the name of categorical variable and its possible values i.e., ‘sex’ : [‘male’, ‘female’]

CATEGORIES = {
'sex': ['male', 'female'],
'class' : ['First', 'Second', 'Third'],
'deck' : ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
'embark_town' : ['Cherbourg', 'Southhampton', 'Queenstown'],
'alone' : ['y', 'n']
}

Now we will be using tf.feature_column.indicator_column for each categorical column.

categorical_columns = []
for feature, vocab in CATEGORIES.items():
cat_col = tf.feature_column.categorical_column_with_vocabulary_list(
key=feature, vocabulary_list=vocab)
categorical_columns.append(tf.feature_column.indicator_column(cat_col))

tf.feature_column.categorical_column_with_vocabulary_list(): a CategoricalColumn with inmemory vocabulary. ‘inmemory’ here means that this function do not want to complicate categorical data by considering cross possibilities and keep it to a level to which it can control ‘in memory’.

# See what you just created.
categorical_columns
[IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='sex', vocabulary_list=('male', 'female'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='alone', vocabulary_list=('y', 'n'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='deck', vocabulary_list=('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='embark_town', vocabulary_list=('Cherbourg', 'Southhampton', 'Queenstown'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='class', vocabulary_list=('First', 'Second', 'Third'), dtype=tf.string, default_value=-1, num_oov_buckets=0))]

Then we will do the same last line with the numerical data case.

Previously,

numeric_layer = tf.keras.layers.DenseFeatures(numeric_columns)
numeric_layer(example_batch).numpy()

In categorical data case,

categorical_layer = tf.keras.layers.DenseFeatures(categorical_columns)
print(categorical_layer(example_batch).numpy()[0])

the result is:

[0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1.]

C. Combined Preprocessing Layer

We have cut all the ingredients into same sizes or properly, now we are going to put all the prepared ingredients into a bowl to mix a salad.

preprocessing_layer = tf.keras.layers.DenseFeatures(categorical_columns+numeric_columns)print(preprocessing_layer(example_batch).numpy()[0])

[ 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. -0.69 1.264 2.043 4.176 0. 1. ]

Side note) Types of Data/Variables. You might have learned this in your stat 101 course. However, it would be helpful to remind.
Data is divided into Numerical(Quantitative) and Categorical(Qualitative) data. The Numerical data is again divided into discrete(can count i.e., Head or Tail from flipping a coin, number of person visiting a store) and continuous numerical data(cannot count, i.e., height in cm.) And categorical data is divided into ordinal(order exists i.e., letter grade) and cardinal data(no orders i.e., gender).

VI. Build the Model

A obvious spoiler of this post is that we are at the very end. It feels unfair that we have not even started modeling, training, testing which is the very sweet moment of programming. So is cooking! Grocery shopping, washing, chopping is very time-consuming but once you lit the fire, cooking goes fast and dynamic. Plus, without proper preparation there is a bare possibility that your food is amazing.

So, stay focused! we are about to start stirring our preprocessing_layer and will finish very soon!

model = tf.keras.Sequential([
preprocessing_layer,
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid'),
])

model.compile(
loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])

Basically, what we have done from II to VI is making preprocessing_layer. People say that how well you cleanup and select necessary data is the key of modeling.

VII. Train, Evaluate, and Predict

  1. Train
train_data = packed_train_data.shuffle(500)
test_data = packed_test_data
model.fit(train_data, epochs=20)

2. Check its accuracy on the test_data set.

test_loss, test_accuracy = model.evaluate(test_data)

print('\n\nTest Loss {}, Test Accuracy {}'.format(test_loss, test_accuracy))

Test Loss 0.47088844026878196, Test Accuracy 0.8068181872367859

3. Predict: we will predict the probability of survival upon given information.

predictions = model.predict(test_data)

# Show some results
for prediction, survived in zip(predictions[:10], list(test_data)[0][1][:10]):
print("Predicted survival: {:.2%}".format(prediction[0]),
" | Actual outcome: ",
("SURVIVED" if bool(survived) else "DIED"))

Predicted survival: 90.76% | Actual outcome: SURVIVED
Predicted survival: 26.12% | Actual outcome: DIED
Predicted survival: 20.50% | Actual outcome: DIED
Predicted survival: 14.20% | Actual outcome: DIED
Predicted survival: 11.52% | Actual outcome: DIED

That is all for loading csv to tensorflow. Hope it was helpful!

You may be interested in loading Pandas dataframe into Tensorflow. Then try this.

Any comments and questions are always welcomed. And claps are loved!

Thank you for your interest!

Have a very pleasant day.

--

--

A Ydobon
A Ydobon

Responses (1)

Write a response