Data Science Tutorial
- What is Data Science?
- Life Cycle of Data Analytics
- What is Machine Learning?
- Python Tools in Data Science
- Working with DataBase
- Data Science using R
- Hierarchical Indexing
- Data Science Using Scikit
- Clustering in Data Science
- Working with Network Data
- What is Plotting
- String Manipulation
- What is Text Analysis?
Python Tools in Data Science
Python has several libraries that can help you deal with extensive data. They range from smarter data structures over code optimizers to just-in-time compilers. The following is a list of libraries we like to use when confronted with extensive data:
The closer you get to the actual hardware of a computer, the more vital it is for the computer to know what types of data it has to process. For a computer, adding 1 + 1 is different from adding 1.00 + 1.00. The first example consists of integers, and the second consists of floats and different parts of the CPU perform these calculations. In Python you don’t have to specify what data types you’re using, so the Python compiler has to infer them. But inferring data types is a slow operation and is partially why Python isn’t one of the fastest languages available. Cython, a superset of Python, solves this problem by forcing the programmer to specify the data type while developing the program. Once the compiler has this information, it runs programs much faster. See http://cython.org/ for more information on Cython.
Numexpr is at the core of many of the big data packages, as is NumPy for in-memory packages. Numexpr is a numerical expression evaluator for NumPy but can be many times faster than the original NumPy. To achieve this, it rewrites your expression and uses an internal (just-in-time) compiler. See https://github.com/pydata/numexpr for details on Numexpr.
Numba helps you to achieve more incredible speed by compiling your code right before you execute it, also known as just-in-time compiling. This gives you the advantage of writing high-level code but achieving rates similar to those of C code. Using Numba is straightforward; see http://numba.pydata.org/.
Bcolz helps you overcome the out-of-memory problem that can occur when using NumPy. It can store and work with arrays in an optimal compressed form. It not only slims down your data need but also uses Numexpr in the background to reduce the calculations needed when performing calculations with bcolz arrays. See
Blaze is ideal if you want to use the power of a database backend but like the “Pythonic way” of working with data. Blaze will translate your Python code into SQL but can handle many more data stores than relational databases such as CSV, Spark, and others. Blaze delivers a unified way of working with many databases and data libraries. Blaze is still in development, though, so many features are not implemented yet. See http://blaze.readthedocs.org/en/latest/ index.html.
Theano enables you to work directly with the graphical processing unit (GPU) and do symbolical simplifications whenever possible, and it comes with an excellent just-in-time compiler. On top of that, it’s a great library for dealing with an advanced but useful mathematical concept: tensors. See http:// deeplearning.net/software/theano/.
Dask enables you to optimize your flow of calculations and execute them efficiently. It also enables you to distribute calculations. See http:// dask.pydata.org/en/latest/.
These libraries are mostly about using Python itself for data processing (apart from Blaze, which also connects to databases). To achieve high-end performance, you can use Python to communicate with all sorts of databases or other software.