We have provided many tutorials on how to generate the requirements.txt for your python project without environments, how to work with Conda environments, how to work with VS Code and Virtual Environments and so on. Today, we will provide an alternative way to get the requirement.txt file by including only the library that we have used, in other words, only the libraries that we have imported. We will provide two approaches, the first one is when we work with .py files and the second one when we work with Jupyter notebooks.
Let’s assume that we work on the project, called “pipreqs_example“, where there is our .py file containing the code of the project. In order to make it reproducible, we would like to generate the “requirements.txt” file but ONLY for the used libraries. We can easily achieve that by using the pipreqs library. We can install the library as follows:
pip install pipreqs
Within the project I have a .py file with the following imports:
import numpy as np import pandas as pd import re
Let’s see how we can generate the requirements.txt file. We can either specify the whole path of the project, or run the following command within the project path.
pipreqs --force
And the requirements.txt appears in the project directory!
pandas==1.2.5 numpy==1.21.1
Usage:
Usage: pipreqs [options] [<path>] Arguments: <path> The path to the directory containing the application files for which a requirements file should be generated (defaults to the current working directory) Options: --use-local Use ONLY local package info instead of querying PyPI --pypi-server <url> Use custom PyPi server --proxy <url> Use Proxy, parameter will be passed to requests library. You can also just set the environments parameter in your terminal: $ export HTTP_PROXY="http://10.10.1.10:3128" $ export HTTPS_PROXY="https://10.10.1.10:1080" --debug Print debug information --ignore <dirs>... Ignore extra directories, each separated by a comma --no-follow-links Do not follow symbolic links in the project --encoding <charset> Use encoding parameter for file open --savepath <file> Save the list of requirements in the given file --print Output the list of requirements in the standard output --force Overwrite existing requirements.txt --diff <file> Compare modules in requirements.txt to project imports --clean <file> Clean up requirements.txt by removing modules that are not imported in project --mode <scheme> Enables dynamic versioning with <compat>, <gt> or <non-pin> schemes <compat> | e.g. Flask~=1.1.2 <gt> | e.g. Flask>=1.1.2 <no-pin> | e.g. Flask
If you work with Jupyter notebooks, you can use the pipreqsnb library. You can install the library as follows:
pip install pipreqsnb
We work similarly as before, but not the command is:
pipreqsnb --force
Note that pipreqsnb is a very simple fully compatible pipreqs wrapper that supports python files and jupyter notebooks.
Usage:
Usage: pipreqsnb [options] <path> Options: --use-local Use ONLY local package info instead of querying PyPI --pypi-server <url> Use custom PyPi server --proxy <url> Use Proxy, parameter will be passed to requests library. You can also just set the environments parameter in your terminal: $ export HTTP_PROXY="http://10.10.1.10:3128" $ export HTTPS_PROXY="https://10.10.1.10:1080" --debug Print debug information --ignore <dirs>... Ignore extra directories (sepparated by comma no space) --encoding <charset> Use encoding parameter for file open --savepath <file> Save the list of requirements in the given file --print Output the list of requirements in the standard output --force Overwrite existing requirements.txt --diff <file> Compare modules in requirements.txt to project imports. --clean <file> Clean up requirements.txt by removing modules that are not imported in project. --no-pin Omit version of output packages.
When you work on projects, using environments of many installed libraries that are not used in that particular project, it is better to share the requirements.txt of the used libraries only. A good application of pipreqs is when you work with Jupyter Notebooks on AWS SageMaker or with Colab and you just want to know what version of libraries you have used.
I am pleased to announce the 0.9.0 release of the data algebra.
The data algebra is realization of the Codd relational algebra for data in written in terms of Python method chaining. It allows the concise clear specification of useful data transforms. Some examples can be found here. Benefits include being able to specify a single data transformation that can then be translated and executed in many realizations, currently including Pandas, Google Big Query, PostgreSQL, Spark, and SQLite. It allows you to rehearse and debug your big data work in memory.
Some noteable features of the 0.9.0 PyPi release include:
WITH
operator for more better machine generated SQL.
We’ve been using the data algebra to speed up development on both client and internal Python data science projects. I invite you to give it a try.
After the Machine Learning model deployment we need somehow to validate the incoming datasets before we move on and input them in the ML pipeline. We can’t just rely on our sources and take granted that the data will be ok. There might be new columns, new values or even wrong types of data and most of the time the model will ignore them. That means that we may end up using an outdated or biased model.
In this post, I will show you a simple and fast way to validate your data using Tensorflow Data Validation. TFDV is a powerful library that can compute descriptive statistics, infer a scheme and detect data anomalies at scale. It is used to analyse and validate petabytes of data at Google every day across thousands of different applications that are in production.
But first, let’s create some dummy data.
import pandas as pd import numpy as np import tensorflow_data_validation as tfdv df=pd.DataFrame({'Name':np.random.choice(['Billy','George'],100),'Number': np.random.randn(100),'Feature': np.random.choice(['A','B'],100)}) df.head()
Firstly TFDV will create a “schema” of our original data so we can use it later to validate the new data.
df_stats = tfdv.generate_statistics_from_dataframe(df) schema = tfdv.infer_schema(df_stats) schema
feature { name: "Name" type: BYTES domain: "Name" presence { min_fraction: 1.0 min_count: 1 } shape { dim { size: 1 } } } feature { name: "Number" type: FLOAT presence { min_fraction: 1.0 min_count: 1 } shape { dim { size: 1 } } } feature { name: "Feature" type: BYTES domain: "Feature" presence { min_fraction: 1.0 min_count: 1 } shape { dim { size: 1 } } } string_domain { name: "Name" value: "Billy" value: "George" } string_domain { name: "Feature" value: "A" value: "B" }
As you can see the schema is a JSON type output that has characteristics of the data. We can display it in a nice format as follows:
tfdv.display_schema(schema)
The schema can be saved and loaded using the following code.
from tensorflow_data_validation.utils.schema_util import write_schema_text, load_schema_text #save write_schema_text(schema, "my_schema") #load schema = load_schema_text("my_schema")
Let’s suppose that we created a machine learning model with the data above. Now we will create the hypothetical new data that we want to validate.
test=pd.DataFrame({'Name': {0: 'Guilia', 1: 'Billy', 2: 'George', 3: 'Billy', 4: 'Billy'}, 'Number': {0: 1, 1: 2, 2: 3, 3: 5, 4: 1}, 'Feature': {0: 'A', 1: 'A', 2: 'A', 3: 'B', 4: 'A'}}) test['Feature2']='YES' test.head()
This is the time to validate the new data.
new_stats = tfdv.generate_statistics_from_dataframe(test) anomalies = tfdv.validate_statistics(statistics=new_stats, schema=schema) tfdv.display_anomalies(anomalies)
We got all the “anomalies” the new data have. The new data have a new column, the column Number has wrong data types and the have a new value in the Name column. This may indicate data drift. Now we may have to retrain our model or apply different data preprocessing. Also, TFDV has the option to update the schema or to ignore some of the anomalies.
Data validation using TFDV is a cost-effective way to validate the new coming set. It will parse the new data and report any anomalies they have such as missing values, new columns, and new values. Also can help us determine if there is data drift and prevent us from using an outdated model.
Data Scientists love to work with Jupyter notebooks. It is possible that in the machine that you have to work on, the Jupyter Notebook with the Anaconda distribution is not available and you do not want to install something so heavy for a one-off task. One option could be to work with Colab, and another option, that we will discuss right now, is to work with Dockers.
Jupyter Docker Stacks are a set of ready-to-run Docker images containing Jupyter applications, enabling us to run Jupyter Notebooks and Jupyter Labs in a local Docker container and JupyterLab servers for a team using JupyterHub. Finally, we can create our own Dockerfiles.
Assume that the data that you want to work with are in your local PC and you need to load them to the Docker container and moreover, the output of the analysis which will be generated in the container, needs to be copied to your local PC. In other words, we need to synchronize the local working environment with the Jupyter Notebook Docker Container. So let’s assume that our working directory is under C:\ForTheBlog\jupyter_docker (windows) where there is the mydata.tsv file.
Our goal is to synchronize this folder with the Jupyter Notebook.
Assuming that you have installed Docker, then it requires a single command to synchronize your local directory with the Jupyter Notebook Docker Container.
docker run -it -p 8888:8888 -v //c/ForTheBlog/jupyter_docker:/home/jovyan/work/tmp --rm --name jupyter jupyter/datascience-notebook
Where:
As we can see the mydata.tsv is insider the Docker container.
Let’s create a new file now within the Docker container.
As you can see, we created two new files, the MyNotebook.ipynb and the mydata.csv. Not surprisingly, these two new files are also in our local folder!
If you are done, you can stop the container. First, you need to find the container id and then to stop it.
docker ps -a docker stop 5ccf
With a single line of code, you can work with Jupyter Notebooks in Dockers like working on your local computer.
Tree-based models are probably the second easiest ML technique for explaining the model to a non-data scientist. I am a big fan of tree-based models because of their simplicity and interpretability. But, when I try to visualize them is, when it gets my nerves. There are so many packages out there to visualize them. Sklearn has finally provided us with a new API to visualize trees through matplotlib. In this tutorial, I will show you how to visualize trees using sklearn for both classification and regression.
The following are the libraries that are required to load datasets, split data, train models and visualize them.
from sklearn.datasets import load_wine, fetch_california_housing from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt from sklearn.tree import plot_tree, DecisionTreeClassifier, DecisionTreeRegressor
In this section, our objective is to
# load wine data set data = load_wine() x = data.data y = data.target # split into train and test data x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=42) # create a decision tree classifier clf = DecisionTreeClassifier(max_depth=2, random_state=0) clf.fit(x_train, y_train) # plot classifier tree plt.figure(figsize=(10,8)) plot_tree(clf, feature_names=data.feature_names, class_names=data.target_names, filled=True)
Once you execute the above code, you should have the following or similar decision tree for the wine dataset model.
Similar to classification, in this section, we will train and visualize a model for regression
# load data set data = fetch_california_housing() x = data.data y = data.target # split into train and test data x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=42) # create a decision tree regressor clf = DecisionTreeRegressor(max_depth=2, random_state=0) clf.fit(x_train, y_train) # plot tree regressor plt.figure(figsize=(10,8)) plot_tree(clf, feature_names=data.feature_names, filled=True)
Once you execute the following code, you should end with a graph similar to the one below.
As you can see, visualizing a decision tree has become a lot simpler with sklearn models. In the past, it would take me about 10 to 15 minutes to write a code with two different packages that can be done with two lines of code. I am definitely looking forward to future updates that support random forest and ensemble models.
Thank you for going through this article. Kindly post below if you have any questions or comments below.
You can also find code for this on my Github page.
The post Visualizing trees with Sklearn appeared first on Hi! I am Nagdev.
Regression analysis is a process of building a linear or non-linear fit for one or more continuous target variables. That’s right! there can be more than one target variable. Multi-output machine learning problems are more common in classification than regression. In classification, the categorical target variables are encoded to convert them to multi-output. In my professional experience, I see about 90% of the data science regression problems usually have a single target variable and the rest usually require fitting for multiple target variables. Some applications for multi-output target variable problems are in forecasting and predictive maintenance.
In the next couple of sections, let me walk you through, how to solve multi-output regression problems using sklearn.
from sklearn.datasets import make_regression from sklearn.model_selection import train_test_split from sklearn.multioutput import MultiOutputRegressor from sklearn.ensemble import RandomForestRegressor
There are few packages that we would be loading here
x, y = make_regression(n_targets=3)
Here we are creating a random dataset for a regression problem. We will create three target variables and keep the rest of the parameters to default. The below will show the shape of our features and target variables.
x.shape y.shape
The following block of code will spit our features and target variables into train and test split. Our train set will have 70% of the features and the test will have 30% of the features.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=42)
Next, we can train our multi-output regression model using the below code.
According to the sklearn package, “This strategy consists of fitting one regressor per target. This is a simple strategy for extending regressors that do not natively support multi-target regression“.
clf = MultiOutputRegressor(RandomForestRegressor(max_depth=2, random_state=0)) clf.fit(x_train, y_train)
The following block of code will perform prediction for the first test observation and calculates the coefficient of determination of the prediction. Since the dataset is a randomly created data set, we cannot expect it to have a good R^{2 }value.
clf.predict(x_test[[0]]) clf.score(x_test, y_test, sample_weight=None)
from sklearn.datasets import make_regression from sklearn.model_selection import train_test_split from sklearn.multioutput import MultiOutputRegressor from sklearn.ensemble import RandomForestRegressor # create regression data x, y = make_regression(n_targets=3) # split into train and test data x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=42) # train the model clf = MultiOutputRegressor(RandomForestRegressor(max_depth=2, random_state=0)) clf.fit(x_train, y_train) # predictions clf.predict(x_test)
Finally, we can put all the code together and as you can see with few lines of code, one can easily build a multi-output regression model using sklearn. In my next tutorial, I will show you how to do multi-output regression using deep learning and the Keras package.
Hope you enjoyed this tutorial. Feel free to drop the comments about this tutorial.
The post Multi-Output Regression using Sklearn appeared first on Hi! I am Nagdev.
We have provided examples of how to work with conda environments. In this post, we will provide you a walk-through example of how to work with VS Code and virtual environments.
When we work on a Data Science project, which can include a Flask API, it is better to have full control over the libraries used in the project. Moreover, it is more efficient to work with the necessary only libraries. This is because with the virtual environments, the project is reproducible, and we will need to install only the required libraries as stated in the requirements.txt
. Finally, it is less risky to mess with your other projects when you work with virtual environments.
For this example, we call our project “venv_example“, and we have created a folder with the same name. Within this folder, we can create a virtual environment called “myvenv” by running the following command:
# Linux sudo apt-get install python3-venv # If needed python3 -m venv myvenv # macOS python3 -m venv myvenv # Windows python -m venv myvenv
Then, we can open the folder “venv_example” from the VS Code using the File > Open Folder command. Then In VS Code, open the Command Palette (View > Command Palette or (Ctrl+Shift+P)). Then, select the Python: Select Interpreter command and then the environment that we created “myenv“:
Then run Terminal: Create New Terminal (Ctrl+Shift+`)) from the Command Palette, that opens a new python terminal and in parallel it activates the virtual environment.
Confirm that that new environment is selected (Hint: look at the blue status bar at the bottom of the VS code) and then update the pip
in the virtual environment:
python -m pip install --upgrade pip
Finally, let’s install the pandas and flask libraries
python -m pip install flask python -m pip install pandas
Using the pip freeze
command we can generate the requirement.txt
file based on the libraries that we installed in our virtual environment.
In the terminal of the activated virtual environment, we can run:
pip freeze > requirements.txt
As we can see, in our folder, there is the requirements.txt
file as well as the myenv
folder. Now, anyone can create the same environment by running the pip install -r requirements.txt
command to reinstall the packages.
Another way to activate the environment is by running source myvenv/bin/activate
(Linux/macOS) or myvenv\Scripts\Activate.ps1
(Windows).
In case that you want to remove the environment, you can simply run:
rm -rf myvenv
The following is the same content that I have posted in my other blog on this topic in R, but written in Python. While I actually first wrote the code for doing this in Python, I’ll be posting the similar verbiage from that blog here.
Often when scraping data, websites will ask a user to enter a postal code to get the locations near it. If you are interested in collecting data on locations in Canada for an entire Province or the entire province from a site, it might be hard to find a list of all postal codes or FSAs in Canada or in a given province which is easy to use. Information on how FSAs work can be found here.
In this blog, I’m going to share a brief snippet of code that you can use to generate Canadian FSAs. While some FSAs generated may not actually exist, if we follow the rules about Canadian postal codes, it serves as a good substitute in lieu of an actual list.
is essentially 3 nested for-loops. While many programmers would not advise writing nested for-loops, I find that for this case it is easier to understand and write.
import string fsa_list =[] alphabet = list(string.ascii_uppercase) province_alphabet = ["A","B","C","E","G","H","J","K","L","M","N","P","R","S","T","V","X","Y"] nonLetters=["D", "F", "I", "O", "Q", "U"] second_letter= list(set(string.ascii_uppercase)-set(nonLetters)) for letter1 in province_alphabet: for number in [0,1,2,3,4,5,6,7,8,9]: for letter2 in alphabet: fsa_list.append(letter1+str(number)+letter2)
We now have our simulated FSAs!
import random random.sample(fsa_list,10)
['A0I', 'S7H', 'R5O', 'V4T', 'R9G', 'B6W', 'T6D', 'B9O', 'M0B', 'Y7H']
Be sure to subscribe and never miss an update!
In this article we will explore binomial distribution and binomial test in Python.
Table of contents
To continue following this tutorial we will need the following Python libraries: scipy, numpy, and matplotlib.
If you don’t have it installed, please open “Command Prompt” (on Windows) and install it using the following code:
pip install scipy pip install numpy pip install matplotlib
Binomial distribution is one of the most popular distributions in statistics, along with normal distribution. Binomial distribution is a discrete probability distribution of a number of successes (\(X\)) in a sequence of independent experiments (\(n\)). Each experiment has two possible outcomes: success and failure. Success outcome has a probability \(p\), and failure has probability \((1-p)\).
Note: an individual experiment is also called a Bernoulli trial, an experiment with exactly two possible outcomes. And binomial distribution for one experiment (\(n=1\)) is also a Bernoulli distribution.
In other words, binomial distribution models the probability of observing either success or failure outcome in an independent experiment that is repeated multiple times.
Let’s say probability of success is equal to:
$$p$$
then probability of failure is equal to:
$$q=1-p$$
So the probability of achieving \(k\) successes and \(n-k\) failures is equal to:
$$p^k \times (1-p)^{n-k}$$
And the number of ways to achieve \(k\) successes is calculated as:
$$\frac{n!}{(n-k)! \times k!}$$
Using the above notations we can solve for a probability mass function (total probability of achieving \(k\) successes \(n\) experiments):
$$f(k;n,p) = Pr(k;n,p) = Pr(X=k) = \frac{n!}{(n-k)! \times k!} p^k (1-p)^{n-k}$$
Note: probability mass function (pmf) – a function that gives the probability that a discrete random variable is exactly equal to some value.
And the formula for the binomial cumulative probability function is:
$$F(k;n,p) = \sum^{x}_{i=0} \frac{n!}{(n-i)! \times i!} p^i (1-p)^{n-i}$$
Example:
You are rolling a single 6-sided die 12 times, and you want to find out the probability of getting 3 as an outcome 5 times. Here, getting 3 is a success outcome, while getting anything else (1, 2, 4, 5, 6) is a failure outcome. Clearly, on each roll, your probability of getting 3 is \(\frac{1}{6}\).
According to the data we have here, rolling a die 12 times, you should get 3 as an outcome 2 times (\(12 \times \frac{1}{6}\)).
Now, how do we actually calculate the probability of observing 3 as an outcome 5 times?
Using the above formula we can easily solve for it. We have an experiment that occurs 12 times (\(n\) = 12), number of outcomes in question is 5 (\(k\) = 5), and probability is \(\frac{1}{6}\) or 0.17 rounded (\(p\) = 0.17).
Plugging into the above equation we get:
$$Pr(5;12,0.17) = Pr(X=5) = \frac{12!}{(12-5)! \times 5!} 0.17^5 (1-0.17)^{12-5} = 0.03$$
and the binomial distribution for such experiment would look like this:
You can clearly see that observing 3 as an outcome has the highest probability at 2 times, and probability of observing it 5 times is less than 0.05.
Let’s now explore how to create the binomial distribution values and plot it using Python. In this section, we will work with three Python libraries: numpy, matplotlib, and scipy.
We will first import the required modules:
import numpy as np import matplotlib.pyplot as plt from scipy.stats import binom
For the data, we will continue with the example from the previous section, where we roll a die 12 times (\(n\) = 12) wherethe probability of observing any number from 1 to 6 is \(\frac{1}{6}\) or 0.16 (\(p\) = 0.17).
Now we will create values for them in Python:
n = 12 p = 0.17 x = np.arange(0, n+1)
where \(x\) is an array of numbers from 0 to 12, representing the number of times any number can be observed.
n = 12 p = 0.17 x = np.arange(0, n+1)
Using this data we can now calculate the binomial probability mass function. Probability mass function (PMF) is a function that gives the probability that a binomial discrete random variable is exactly equal to some value.
In our example, it will show the number of times from 12 rolls you can observe any number that has probability of 0.17.
Construct PMF:
binomial_pmf = binom.pmf(x, n, p) print(binomial_pmf)
And you should get an array with 13 values (which are the probabilities for our \(x\) values):
[1.06890008e-01 2.62717609e-01 2.95952970e-01 2.02056244e-01 9.31162813e-02 3.05152151e-02 7.29178834e-03 1.28014184e-03 1.63873579e-04 1.49175414e-05 9.16620011e-07 3.41348087e-08 5.82622237e-10]
Now that we have the binomial probability mass function, we can easily visualize it:
plt.plot(x, binomial_pmf, color='blue') plt.title(f"Binomial Distribution (n={n}, p={p})") plt.show()
and you should get:
Now, how about trying to interpret what we see?
The graph shows that if we choose any number from 1 to 6 (die sides) and roll the die 12 times, the highest probability for any of those numbers to be observed is 2 times.
In other words, if I choose number 1 and roll the die 12 times, most likely 1 will show up 2 times.
If you ask, what is the probability that 1 will show up 6 times? By looking at the above graph you can see that it’s slightly more than 0.02 or 2%.
Binomial test is a one-sample statistical test of determining whether a dichotomous score comes from a binomial probability distribution.
Using the example from the previous section, let’s reword the question in a way that we can do some hypothesis testing. The following is the situation:
You suspect that a die is biased towards number 3 (three dots) you decided to roll it 12 times (\(n\) = 12) and observed a value of 3 (three dots) 5 times (\(k\) = 5) . You want to understand whether the die is biased towards number 3 or not (recall that expected probability of observing 3 is \(\frac{1}{6}\) or 0.17).
Formulating hypothesis we would have:
$$H_0: \pi \leq \frac{1}{6}$$
$$H_1: \pi > \frac{1}{6}$$
And now calculating the probability:
$$Pr(5;12,0.17) = Pr(X=5) = \frac{12!}{(12-5)! \times 5!} 0.17^5 (1-0.17)^{12-5} = 0.03$$
Here the probability is the \(p\)-value for the significance test. Since 0.03<0.05, we reject the null hypothesis and accept the alternative hypothesis that the die is biased towards number 3.
Let’s now use Python to do the binomial test for the above example.
It is a very simple few line implementation of .binomtest() function from the scipy library.
Step 1:
Import the function.
from scipy.stats import binomtest
Step 2:
Define the number of successes (\(k\)), define the number of trials (\(n\)), and define the expected probability success (\(p\)).
k=5 n=12 p=0.17
Step 3:
Perform the binomial test in Python.
res = binomtest(k, n, p) print(res.pvalue)
and we should get:
0.03926688770369119
which is the \(p\)-value for the significance test (similar number to the one we got by solving the formula in the previous section).
Note: by default, the test computed is a two-tailed test. If you are working with one-tailed test situation, please refer to the scipy documentation of this function.
In this article we explored binomial distribution and binomial test, as well as how to create and plot binomial distribution in Python, and perform a binomial test in Python.
Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Statistics articles.
The post Binomial Distribution and Binomial Test in Python appeared first on PyShark.
Exploratory data analysis is a very important procedure in Data Science. Whatever we want to do with the data, we have to summarise their main characteristics so we can have a better understanding of them. Sometimes this can be hard to do and often we end up with big and complex outputs. In this post, we will show you 3 ways to perform quick exploratory data analysis in a nice readable format.
We will use the Titanic dataset from Kaggle.
import pandas as pd import numpy as np df=pd.read_csv('train.csv') df.head()
TensorFlow Data Validation (TFDV) is a library for exploring and validating machine learning data. It contains a very useful function that can generate statistics from a data frame with just one line of code.
import tensorflow_data_validation as tfdv stats = tfdv.generate_statistics_from_dataframe(df) tfdv.visualize_statistics(stats)
As you can see, we are getting in a nice format a nice summarization of our numeric and categorical features.
Quickda is an amazing library that is capable to produce a professional HTML interactive output.
import pandas as pd from quickda.explore_data import * from quickda.clean_data import * from quickda.explore_numeric import * from quickda.explore_categoric import * from quickda.explore_numeric_categoric import * from quickda.explore_time_series import * explore(df, method='profile', report_name='Titanic')
The output is an interactive report that contains many statistics of the data such as a complete variable analysis and the correlation between them. Quickda is a great option when we want to share the analysis with others since we can save it as an HTML file.
The Pandas library may not be fancy but is one of the most powerful and useful libraries in data science. We will show you how you can get all the information you want for a basic exploratory data analysis. The main advantage of pandas is that can handle big data where the other libraries can’t.
pd.DataFrame({"values":{col:df[col].unique() for col in df}, 'type':{col:df[col].dtype for col in df}, 'unique values':{col:len(df[col].unique()) for col in df}, 'NA values':{col:str(round(sum(df[col].isna())/len(df),2))+'%' for col in df}, 'Duplicated Values':{col:sum(df[col].duplicated()) for col in df} })
This is just an example of the power of pandas. Of course, you can do many things such as count the values, plot histograms, etc. but sometimes a data frame like the above is the only information we need.
Tensorflow Data Validation and Quickda can give us automatically a great presentation of the characteristics of our data. We encourage you to take a closer look at them because they are both powerful libraries with many capabilities. However, they can’t handle big data and it will be overkill to use them when you only want a basic understanding of your data. Pandas, on the other hand, is light and fast and can handle easily big data. You can get great results if you use it correctly.