TensorFlow Machine Learning Cookbook
上QQ阅读APP看书,第一时间看更新

Working with Data Sources

For most of this book, we will rely on the use of datasets to fit machine learning algorithms. This section has instructions on how to access each of these various datasets through TensorFlow and Python.

Getting ready

In TensorFlow some of the datasets that we will use are built in to Python libraries, some will require a Python script to download, and some will be manually downloaded through the Internet. Almost all of these datasets require an active Internet connection to retrieve data.

How to do it…

  1. Iris data: This dataset is arguably the most classic dataset used in machine learning and maybe all of statistics. It is a dataset that measures sepal length, sepal width, petal length, and petal width of three different types of iris flowers: Iris setosa, Iris virginica, and Iris versicolor. There are 150 measurements overall, 50 measurements of each species. To load the dataset in Python, we use Scikit Learn's dataset function, as follows:
    from sklearn import datasets
    iris = datasets.load_iris()
    print(len(iris.data))
    150
    print(len(iris.target))
    150
    print(iris.target[0]) # Sepal length, Sepal width, Petal length, Petal width
    [ 5.1 3.5 1.4 0.2]
    print(set(iris.target)) # I. setosa, I. virginica, I. versicolor
    {0, 1, 2}
  2. Birth weight data: The University of Massachusetts at Amherst has compiled many statistical datasets that are of interest (1). One such dataset is a measure of child birth weight and other demographic and medical measurements of the mother and family history. There are 189 observations of 11 variables. Here is how to access the data in Python:
    import requests
    birthdata_url = 'https://www.umass.edu/statdata/statdata/data/lowbwt.dat'
    birth_file = requests.get(birthdata_url)
    birth_data = birth_file.text.split('\'r\n') [5:]
    birth_header = [x for x in birth_data[0].split( '') if len(x)>=1]
    birth_data = [[float(x) for x in y.split( ')'' if len(x)>=1] for y in birth_data[1:] if len(y)>=1]
    print(len(birth_data))
    189
    print(len(birth_data[0]))
    11
  3. Boston Housing data: Carnegie Mellon University maintains a library of datasets in their Statlib Library. This data is easily accessible via The University of California at Irvine's Machine-Learning Repository (2). There are 506 observations of house worth along with various demographic data and housing attributes (14 variables). Here is how to access the data in Python:
    import requests
    housing_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data'
    housing_header = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV0']
    housing_file = requests.get(housing_url)
    housing_data = [[float(x) for x in y.split( '') if len(x)>=1] for y in housing_file.text.split('\n') if len(y)>=1]
    print(len(housing_data))
    506
    print(len(housing_data[0]))
    14
  4. MNIST handwriting data: MNIST (Mixed National Institute of Standards and Technology) is a subset of the larger NIST handwriting database. The MNIST handwriting dataset is hosted on Yann LeCun's website (https://yann.lecun.com/exdb/mnist/). It is a database of 70,000 images of single digit numbers (0-9) with about 60,000 annotated for a training set and 10,000 for a test set. This dataset is used so often in image recognition that TensorFlow provides built-in functions to access this data. In machine learning, it is also important to provide validation data to prevent overfitting (target leakage). Because of this TensorFlow, sets aside 5,000 of the train set into a validation set. Here is how to access the data in Python:
    from tensorflow.examples.tutorials.mnist import input_data
    mnist = input_data.read_data_sets("MNIST_data/"," one_hot=True)
    print(len(mnist.train.images))
    55000
    print(len(mnist.test.images))
    10000
    print(len(mnist.validation.images))
    5000
    print(mnist.train.labels[1,:]) # The first label is a 3'''
    [ 0.  0.  0.  1.  0.  0.  0.  0.  0.  0.]
  5. Spam-ham text data. UCI's machine -learning data set library (2) also holds a spam-ham text message dataset. We can access this .zip file and get the spam-ham text data as follows:
    import requests
    import io
    from zipfile import ZipFile
    zip_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip'
    r = requests.get(zip_url)
    z = ZipFile(io.BytesIO(r.content))
    file = z.read('SMSSpamCollection')
    text_data = file.decode()
    text_data = text_data.encode('ascii',errors='ignore')
    text_data = text_data.decode().split(\n')
    text_data = [x.split(\t') for x in text_data if len(x)>=1]
    [text_data_target, text_data_train] = [list(x) for x in zip(*text_data)]
    print(len(text_data_train))
    5574
    print(set(text_data_target))
    {'ham', 'spam'}
    print(text_data_train[1])
    Ok lar... Joking wif u oni...
  6. Movie review data: Bo Pang from Cornell has released a movie review dataset that classifies reviews as good or bad (3). You can find the data on the website, http://www.cs.cornell.edu/people/pabo/movie-review-data/. To download, extract, and transform this data, we run the following code:
    import requests
    import io
    import tarfile
    movie_data_url = 'http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz'
    r = requests.get(movie_data_url)
    # Stream data into temp object
    stream_data = io.BytesIO(r.content)
    tmp = io.BytesIO()
    while True:
        s = stream_data.read(16384)
        if not s:
            break
        tmp.write(s)
    stream_data.close()
    tmp.seek(0)
    # Extract tar file
    tar_file = tarfile.open(fileobj=tmp, mode="r:gz")
    pos = tar_file.extractfile('rt'-polaritydata/rt-polarity.pos')
    neg = tar_file.extractfile('rt'-polaritydata/rt-polarity.neg')
    # Save pos/neg reviews (Also deal with encoding)
    pos_data = []
    for line in pos:
        pos_data.append(line.decode('ISO'-8859-1').encode('ascii',errors='ignore').decode())
    neg_data = []
    for line in neg:
        neg_data.append(line.decode('ISO'-8859-1').encode('ascii',errors='ignore').decode())
    tar_file.close()
    print(len(pos_data))
    5331
    print(len(neg_data))
    5331
    # Print out first negative review
    print(neg_data[0])
    simplistic , silly and tedious .
  7. CIFAR-10 image data: The Canadian Institute For Advanced Research has released an image set that contains 80 million labeled colored images (each image is scaled to 32x32 pixels). There are 10 different target classes (airplane, automobile, bird, and so on). The CIFAR-10 is a subset that has 60,000 images. There are 50,000 images in the training set, and 10,000 in the test set. Since we will be using this dataset in multiple ways, and because it is one of our larger datasets, we will not run a script each time we need it. To get this dataset, please navigate to http://www.cs.toronto.edu/~kriz/cifar.html, and download the CIFAR-10 dataset. We will address how to use this dataset in the appropriate chapters.
  8. The works of Shakespeare text data: Project Gutenberg (5) is a project that releases electronic versions of free books. They have compiled all of the works of Shakespeare together and here is how to access the text file through Python:
    import requests
    shakespeare_url = 'http://www.gutenberg.org/cache/epub/100/pg100.txt'
    # Get Shakespeare text
    response = requests.get(shakespeare_url)
    shakespeare_file = response.content
    # Decode binary into string
    shakespeare_text = shakespeare_file.decode('utf-8')
    # Drop first few descriptive paragraphs.
    shakespeare_text = shakespeare_text[7675:]
    print(len(shakespeare_text)) # Number of characters
    5582212
  9. English-German sentence translation data: The Tatoeba project (http://tatoeba.org) collects sentence translations in many languages. Their data has been released under the Creative Commons License. From this data, ManyThings.org (http://www.manythings.org) has compiled sentence-to-sentence translations in text files available for download. Here we will use the English-German translation file, but you can change the URL to whatever languages you would like to use:
    import requests
    import io
    from zipfile import ZipFile
    sentence_url = 'http://www.manythings.org/anki/deu-eng.zip'
    r = requests.get(sentence_url)
    z = ZipFile(io.BytesIO(r.content))
    file = z.read('deu.txt''')
    # Format Data
    eng_ger_data = file.decode()
    eng_ger_data = eng_ger_data.encode('ascii''',errors='ignore''')
    eng_ger_data = eng_ger_data.decode().split(\n''')
    eng_ger_data = [x.split(\t''') for x in eng_ger_data if len(x)>=1]
    [english_sentence, german_sentence] = [list(x) for x in zip(*eng_ger_data)]
    print(len(english_sentence))
    137673
    print(len(german_sentence))
    137673
    print(eng_ger_data[10])
    ['I won!, 'Ich habe gewonnen!']

How it works…

When it comes time to use one of these datasets in a recipe, we will refer you to this section and assume that the data is loaded in such a way as described in the preceding text. If further data transformation or pre-processing is needed, then such code will be provided in the recipe itself.

See also