Quick Contact


    Files

    A file is a non-volatile container for long-term data storage. A file operation includes opening a file, reading data from the file or writing data into the file, and closing the file. We can open a file for reading (default mode, denoted as “r”), [over]writing (“w”), or appending (“a”). Opening a file for writing destroys the initial content of the file without notice, and opening a non-existent file for reading causes an exception:

    f = open(name, mode=”r”)

    «read the file»

    f.close()

    Python provides an efficient replacement to this paradigm: the with statement allows us to open a file explicitly, but it lets Python close the file automatically after exiting, thus saving us from tracking the unwanted open files.

    with open(name, mode=”r”) as f:

    «read the file»

    The same module as pickle requires that a file be opened in binary mode (“rb”, “wb”, or “ab”). You should also use binary mode for reading/writing raw binary arrays. The following functions read text data from a previously opened file f:

    f.read() # Read all data as a string or a binary

    f.read(n) # Read the first n bytes as a string or a binary

    f.readline() # Read the next line as a string

    f.readlines() # Read all lines as a list of strings

    We can mix and match these functions, as needed. For example, you can read the first string, then the next five bytes, then the next line, and finally the rest of the file. The newline character is not removed from the results returned by any of these functions. Generally, it is unsafe to use the functions read() and readlines() if you cannot assume that the file size is reasonably small

    The following functions write text data to a previously opened file

    f: f.write(line) # Write a string or a binary

    f.writelines(ines) # Write a list of strings

    These functions don’t add a newline character at the end of the written strings —that’s your responsibility.

    According to WorldWideWebSize,1 the indexed web contains at least 4.85 billion pages. The module urllib.request has functions for downloading data from the web. While it may be feasible (though not advisable) to download a single data set by hand, save it into the cache directory, and then analyze it using Python scripts, some data analysis projects call for automated iterative or recursive downloads.

    The first step toward getting anything off the web is to open the URL with the function urlopen(url) and obtain the open URL handle. Once opened, the URL handle is similar to a read-only open file handle: you can use the functions read(), readline(), and readlines() to access the data.

    Due to the dynamic characteristics of the web and the Internet, the likelihood of failing to open a URL is higher than that of opening a local disk file. Remember to enclose any call to a web-related function in an exception handling statement:

    import urllib.request

    try:

    with urllib.request.urlopen(“http://www.networksciencelab.com”)

    as doc: html = doc.read()

    # If reading was successful, the connection is closed automatically

    except:

    print(“Could not open %s” % doc, file=sys.err)

    # Do not pretend that the document has been read!

    # Execute an error handler here

    If the data set of interest is deployed at a website that requires authentication, urlopen() will not work. Instead, use a module that provides Secure Sockets Layer (SSL; for example, OpenSSL).

    The module urllib.parse supplies friendly tools for parsing and unparsing (building) URLs. The function urlparse() splits a URL into a tuple of six elements: scheme (such as http), network address, file system path, parameters, query, and fragment.

    The function urlunparse(parts) constructs a valid URL from the parts returned by urlparse(). If you parse a URL and then unparse it again, the result may be slightly different from the original URL—but functionally fully equivalent.

    CSV Files

    CSV is a structured text file format used to store and move tabular or nearly tabular data. It dates back to 1972 and is a format of choice for Microsoft Excel, Apache OpenOffice Calc, and other spreadsheet software. Data.gov,1 a U.S. government website that provides access to publicly available data, alone provides 12,550 data sets in the CSV format.

    A CSV file consists of columns representing variables and rows representing records. (Data scientists with a statistical background often call them observations.) Commas typically separate the fields in a record, but other delimiters, such as tabs (tab-separated values [TSV]), colons, semicolons, and vertical bars, are also common. Stick to commas when you write your files, but be prepared to face other separators in files written by those who don’t follow this advice.

    The Python module csv provides a CSV reader and a CSV writer. Both objects take a previously opened text filehandle as the first parameter (in the example, the file is opened with the newline=” option to avoid the need to strip the lines). You may provide the delimiter and the quote character, if needed, through the optional parameters delimiter and quotechar. Other optional parameters control the escape character, the line terminator, and so on.

    with open(“somefile.csv”, newline=”) as infile: reader = csv.reader(infile, delimiter=’,’, quotechar='”‘)

    The first record of a CSV file often contains column headers and may be treated differently from the rest of the file. This is not a feature of the CSV format itself, but simply common practice.

    A CSV reader provides an iterator interface for use in a for loop. The iterator returns the next record as a list of string fields. The reader doesn’t convert the fields to any numeric data type (it’s still our job!) and doesn’t strip them of the leading whitespaces unless instructed by passing the optional parameter skipinitialspace=True. If the size of the CSV file is not known and is potentially large, you don’t want to read all records at once. Instead, use incremental, iterative, row-by-row processing: read a row, process the row, discard the row, and then get another one.

    A CSV writer provides the functions writerow() and writerows(). writerow() writes a sequence of strings or numbers into the file as one record. The numbers are converted to strings, so you have one less thing to worry about. In a similar spirit, writerows() writes a list of sequences of strings or numbers into the file as a collection of records.

    In the following example, we’ll use the CSV module to extract the “Answer.Age” column from a CSV file. We’ll assume that the index of the column is not known, but that the column exists. Once we get the numbers, we’ll know the mean and the standard deviation of the age variable with some little help from the module statistics.

    First, open the file and read the data:

    with open(“demographics.csv”, newline=”)

    as infile: data = list(csv.reader(infile))

    data[0], which is the first record in the file. It must contain the column header of interest:

    ageIndex = data[0].index(“Answer.Age”)

    Finally, access the field of interest in the remaining records and calculate and display the statistics:

    ages = [int(row[ageIndex]) for row in data[1:]]

    print(statistics.mean(ages), statistics.stdev(ages)

    JSON Files

    JSON is a lightweight data-interchange format. JSON is language-independent but more restricted in terms of data representation.

    JSON supports the following data types:

    Atomic data types:

    strings, numbers, true, false, null.

    Arrays:

    An array corresponds to a Python list; it’s enclosed in square brackets []; the items in an array don’t have to be of the same data type: [1, 3.14, “a string”, true, null]

    Objects:

    An object corresponds to a Python dictionary; it is enclosed in curly braces {}; every item includes a key and a value, separated by a colon: {“age”: 37, “gender”: “male”, “married”: true}

    Any recursive combinations of arrays, objects, and atomic data types (arrays of objects, objects with arrays as item values, and so on).

    Storing complex data into a JSON file is called serialization. The opposite operation is called deserialization. Python handles JSON serialization and deserialization via the functions in the module json.

    The function dump() exports (“dumps”) a representable Python object to a previously opened text file. The function dumps() exports a representable Python object to a text string (for pretty-printing or interprocess communications). Both functions are responsible for serialization.

    The function loads() converts a valid JSON string into a Python object (it “loads” the object into Python). This conversion is always possible. In the same spirit, the function load() converts the content of a previously opened text file into one Python object. It is an error to store more than one object in a JSON file. However, suppose an existing file still contains more than one object. In that case, you can read it as text, convert the text into an array of objects (by adding square brackets around the text and comma separators between the individual objects), and use loads() to deserialize the text to a list of objects.

    The following code fragment subjects an arbitrary (but serializable) object to a sequence of serializations and deserializations:

    object = «some serializable object»

    # Save an object to a file

    with open(“data.json”, “w”) as out_json:

    json.dump(object, out_json, indent=None, sort_keys=False)

    # Load an object from a file

    with open(“data.json”) as in_json:

    object1 = json.load(in_json)

    # Serialize an object to a string

    json_string = json.dumps(object1)

    # Parse a string as JSON

    object2 = json.loads(json_string)

    Copyright 1999- Ducat Creative, All rights reserved.