[Tensorflow2.0] Load NumPy to Tensorflow

A Ydobon
6 min readSep 25, 2019

There will be parts that are revised and elaborated for better understanding, however, I hereby acknowledge that the following post is based on TensorFlow tutorial provided in:

For more detailed explanations and background knowledge regarding codes and dataset, you can always consult the link.

I. Introduction

What is NumPy?

NumPy is a basic in Python language and we are familiar with what it is.
The basic concept of NumPy is taking a list or a list of sublists as an array.

a = np.array([1, 2, 3])
a returns array([1, 2, 3])
or
b=np.array([[1,2,3],[4,5,6]])
b returns array([[1, 2, 3], [4, 5, 6]])

You can do array math or indexing, etc.

Let’s see what we actually do with NumPy.

In the last post on load CSV file, we used the np.array ( ) when normalizing our numeric data. Remember?

  • MEAN = np.array(desc.T[‘mean’])
    MEAN
    array([29.631, 0.545, 0.38 , 34.385])
  • STD = np.array(desc.T[‘std’])
    STD
    array([12.512, 1.151, 0.793, 54.598])

Then, we used these arrays to facilitate the future process.

I think it is enough of an recap on NumPy, we are preparing Python series so I hope you can read that in a near future.

Why do we want to load NumPy data?

Again, if you have read our last post on Load CSV to tensorflow, you must have seen these screenshots. Tensorflow has been upgraded and the ways and sources to load data has been tremendously diversified.

So, we are going to post on loading data as series and this week we will be learning NumPy and Pandas.

If you are reading this, I bet you have heard about NumPy and pandas

import numpy as np
import pandas as pd

They are so commonly used that very often we just put the two lines when we start ou. From experience, we knos that we will be using them with high probability and we do not want to mess up our coding by importing pandas and numpy out of nowhere. I bet some of you have been habitually writing “import numpy as np, import pandas as pd”.
In other way around, we can deduce that there will be so many data in Numpy or Pandas format. In this post, I will be focusing on Numpy data.

*Warning* Compared to my previous posts, this is going to be sooooo short. Are you as thrilled as I am?! :)

II. Set up

try:
# %tensorflow_version only exists in Colab.
%tensorflow_version 2.x
except Exception:
pass
from __future__ import absolute_import, division, print_function, unicode_literals

import numpy as np
import tensorflow as tf

Again, the technical setup. When are we gonna be free from these lines! Never?

III. Load from .npz file

DATA_URL = 'https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz'path = tf.keras.utils.get_file('mnist.npz', DATA_URL)
with np.load(path) as data:
train_examples = data['x_train']
train_labels = data['y_train']
test_examples = data['x_test']
test_labels = data['y_test']

First, .npz is one of formats that numpy data is constructed, as if .csv was one of the formats that an excel file can be saved.

And the classic MNIST dataset is constructed in a NumPy format.
We get the dataset from URL and SLICE the data into four pieces;
train_examples = data[’x_train’]
train_labels = data[’y_train’]
test_examples = data[’x_test’]
test_labels = data[’y_test’]

And later on we will combine them into 2 pieces; train_dataset and test_dataset.

Side note) Do you remember MNIST dataset?
MNIST dataset has 30,000 handwritten Arabic numbers in a train dataset and our goal is to correctly assign a number to our remaining 10,000 handwritten numbers in test dataset.

III. Load NumPy with tf.data.Dataset

Remember that I said the main goal of loading data in tensorflow is converting any format into tf.data? We will be doing that again.

train_dataset = tf.data.Dataset.from_tensor_slices((train_examples, train_labels))
test_dataset = tf.data.Dataset.from_tensor_slices((test_examples, test_labels))

In section II, we sliced the MNIST dataset into four pieces. Now we want to combine them and simply have training dataset and testing dataset.

First, from
train_examples = data['x_train']
train_labels = data['y_train']

we have an array of examples and a corresponding label array. We will combine them into a tuple of (train_examples, train_labels) and pass it to tf.data.Dataset.from_tensor_slices which will actually glue the two ‘sliced tensors’ into one or train_dataset.

Likewise,
test_examples = data['x_test']
test_labels = data['y_test']

from here we create test_dataset.

We might need some analogies here.

Let’s say you are a teacher. You have 2019FallSemester folder which is equivalent to tensor or ‘data’ you created from with np.load(path) as data: .
You have 4 files (the 4 sliced tensors); practiceexam.doc (train_examples) and practicesol.doc (train_labels) which you are going to provide to your students for them to practice and check their answets. And, you also have final.doc (test_examples) and finalsol.doc (test_labels). You are going to use the final.doc as their final exam and use finalsol.doc as a reference for your T.A to check their answers.

Now you want to combine them into 2 files, in that case we use a zip file. First, student zip file (train_dataset) with practiceexam.doc and practicesol.doc then, teacher zip file (test_dataset) with final.doc and finalsol.doc.

When you create a zip file,you just select the files that you want to zip and click ‘zip these two files’ in the setting.
Likewise, if making a zip file required tensorflow coding, the teacher would use the following code.
student=tf.data.Dataset.from_tensors_slices((practiceexam.doc,practicesol.doc))

IV. Use the Dataset

1 — Shuffle and batch the datasets

BATCH_SIZE = 64
SHUFFLE_BUFFER_SIZE = 100
train_dataset = train_dataset.shuffle(SHUFFLE_BUFFER_SIZE).batch(BATCH_SIZE)
test_dataset = test_dataset.batch(BATCH_SIZE)

The beauty of tf.data is that you can preprocess the datasets very easily.

Here, we are facing something new: “SHUFFLE_BUFFER_SIZE”.
We used to understand very intuitively what shuffle can do from Spotify.

Photo by NeONBRAND on Unsplash

However, it might be a right time for us to understand how shuffle mechanism works.
Let’s say we have 1 mil data. Shuffle will ‘shuffle’ the 1 mil numbers. However, it will take so long to shuffle 1M numbers.
In that case, what we do is defining shuffle buffer size, which we can understand as a small bag. Then we shuffle the 100 numbers, next we will randomly choose 1 number from the bag and replace it for any number outside of the bag, then shuffle it again, and so on. Shuffling will be always done on 100 numbers so it is way more efficient than shuffling 1M numbers.
When everything is done, we will put them in a batch according to the defined batch_size.

2 — Create and train a model

As usual, this is the part where machine learning actually blossoms its capacity, but then it is very short. So, stay tuned.

[Create a Model]

model = tf.keras.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
])

[Model Compiling]

model.compile(optimizer=tf.keras.optimizers.RMSprop(),
loss=tf.keras.losses.SparseCategoricalCrossentropy(),
metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])

[Model Fitting]

model.fit(train_dataset, epochs=10)

[Model Evaluation]

model.evaluate(test_dataset)

This is all for Load NumPy to Tensorflow.
It was not that complicated but it will be very helpful for later use.

Every comments and questions are always welcomed, and thank you for your interest and claps!
Hope you have a very peaceful day!

--

--