Datascience in Towards Data Science on Medium,

The Essential Guide to R and Python Libraries for Data Visualization

12/16/2024 Jesus Santana

Let’s dive into the most important libraries in R and Python to visualise data and create different charts, and what the pros and cons are

Being a pro in certain programming languages is the goal of every aspiring data professional. Reaching a certain level in one of the countless languages is a critical milestone for everyone.

For data engineers, SQL is probably the most important language. As a web developer, you need to know JavaScript, HTML, CSS and PHP in your sleep. For data scientists, on the other hand, Python and R are the preferred tools. Both languages have their strengths and weaknesses — and both offer powerful tools and a large community to analyze and visualize data.

If you’re at the very beginning of your data science journey, the choice between R and Python can be overwhelming. But if you want to move into this field in the long term, you will come into contact with both languages sooner or later anyway. Also, if you’re already at university, you probably have courses in both languages.

But let’s dive into the most important libraries in R and Python to visualize data, how creating charts in R and Python is different (with code examples), and what the pros and cons of the two languages are.

Table of Content
1 — What makes R a must-have? (And essential libraries for visualizations)
2 — Python is everywhere: From data analysis to web development (And essential libraries for visualizations)
3 — Step-by-Step-Guide: Creating plots in R and Python with code examples
4 — Advantages and disadvantages: Comparing R & Python for Data Science
5 — Final Thoughts and where to continue learning

1 — What makes R a must-have?

R was developed in the 1990s at the University of Auckland specifically for statistical analyses and graphical representations. R is particularly well suited for statistical analyses, hypothesis testing, data manipulation and visual representations. If you want to progress into the academic world and participate in research projects, R is a must anyway (especially in areas such as biostatistics, social sciences, psychology & economics). On CRAN (Comprehensive R Archive Network) you can find thousands of R packages, that are hosted there.

Essential R libraries for visualizations that you should know

R has countless libraries. When it comes to data visualization, there are different libraries with their own strengths. Here are 6 important libraries you should definitely know:

1. ggplot2
You need to know this library anyway — it’s the undisputed classic in the R community. You can use it to create user-defined and high-quality visualizations.

2. plotly
With Plotly you can create interactive plots. For example, users can zoom into the diagram or switch between different views. The cool thing is that you can also integrate them into web applications and dashboards.

3. lattice
If you need an alternative to ggplot2, you can use Lattice to create multi-layer plots. ggplot2 is much more common, but lattice scores points because it is less complex for beginners.

4. shiny
If you need a way to present your data in real-time, Shiny is a good choice. You can use it to develop interactive web applications directly in R. It is also possible to integrate visualizations that you have created with ggplot2 or plotly into the dashboards.

5. leaflet
If you want to create interactive geographical maps, it is best to use this library. You can use it to create interactive maps that you can also customize with additional layers, markers and pop-ups.

6. esquisse
I recently discovered esquisse. It is particularly suitable if you want to create a prototype quickly. It is a visual tool that allows you to create ggplot2-based visualizations using drag & drop. This means you can create visualizations without writing a single line of code. You can then export the underlying ggplot2 code to further customize your plots. This library probably deserves its own article…

2 — Python is everywhere: From data analysis to web development

Do you know Monty Python? If not, you should definitely watch some clips from the British comedy group. Python was named after the Monty Python comedy group (not the snake…) and was developed in 1991 (the humor is sometimes a bit dark and needs a bit time — but it’s definitely a classic):

https://www.youtube.com/watch?v=xxamBlMta94

But back to the topic: the language was designed to be highly readable and have a clear syntax. Python is a language for ‘everything’, so to speak: you can use it in data analysis, but also for machine learning, deep learning or in web development (e.g. with the Django framework).

Similar to R with CRAN, Python uses PyPi (Python Package Index) as its central repository with an huge number of libraries to install. While R is mainly used in the fields of statistics and research, Python is used in almost all industries. Now that machine learning and big data are becoming increasingly important, Python is also becoming even more important, as Python is the absolute favourite language for machine learning (with scikit-learn, TensorFlow, Keras & PyTorch).

Essential Python libraries for visualizations that you should know

Here I have put together 8 important libraries that you definitely need to know:

1. matplotlib
Even if you’re a beginner, you’ve almost certainly come across this library before. With this library, you can create a wide range of 2D plots — from simple line charts and histograms to complex subplots. It gives you a lot of control over your plots. For example, if you create a bar chart, you can adjust the axes, colours, fonts or even the width of the bars in the code. However, it is often a bit tedious for complex visualizations — for example, the code becomes longer and more complicated if you want to combine several plots.

2. seaborn
This library is based on matplotlib. It is particularly suitable for statistical visualizations such as heat maps, pair plots and box plots. You can also use seaborn to work directly with your tables (pandas DataFrames) without having to convert the data first. This makes it easy to recognize initial patterns, trends and correlations in your data relatively quickly. The library is particularly useful for exploratory data analysis (EDA). In one of my recent articles ‘Mastering Time Series Data: 9 Essential Steps for Beginners before applying Machine Learning in Python’ you can find some important steps for a EDA.

3. plotly
If you want to create interactive visualizations such as 3D plots, geographical maps or dashboards, use plotly. You can zoom into the diagrams, highlight data points or switch between views. The great thing about the library is that you can also easily integrate the plots into web applications afterwards. If you are just starting out with Python, the plotly.express API is an easy way to get started.

4. pandas
Of course, pandas is a library not just for visualizations but for almost any data manipulation task. If you want to visualise data, you can create it directly from a DataFrame with pandas. With the ‘df.plot()’ method you can easily create line, bar or scatter charts. If you want to get a quick and easy insight into your data before you dive deeper into the analysis, pandas is definitely a good choice.

5. bokeh
This library is ideal if you want to create interactive dashboards that can be embedded directly in HTML. bokeh was specially developed for interactive, web-friendly visualizations. The library impresses with its fast rendering times, especially with large data sets.

6. altair
The counterpart of ggplot2 in R, so to speak — the syntax is very similar. Altair is very suitable for exploratory data analysis (EDA) and you can create meaningful plots quickly and easily.

7. holoviews
Do you need to be super fast and don’t want to write a lot of code? Then the best way to create your visualization is with holoviews. You can create interactive plots with minimal code, making the library ideal for prototypes or if you need quick feedback.

8. folium
With folium you can create interactive geographical maps and visualizations. It is based on leaflet.js and allows you to create maps with markers, heatmaps or clusters. For example, you can display data points on a world map or carry out geographical analyses.

R and Python are both widely used and a great tool for data science.
Own visualization — Illustrations from unDraw.co

3 — Step-by-Step-Guide: Creating plots in R and Python with code examples

The best way to get started is by opening RStudio or Jupyter Lab and running through the code examples yourself step by step.

Prerequisites to run the visualizations with R

In any case, you must have downloaded R and preferably RStudio. Then we install the libraries we need for the visualizations with the following command:

# Installing Packages for R
install.packages(c("ggplot2", "plotly", "leaflet"))

To make the visualizations easy to reproduce, we use the built-in dataset ‘mtcars’, which contains vehicle data such as horsepower, weight and fuel consumption. We load the dataset with the following command:

# Loading data for R
data(mtcars) # Loads the built-in mtcars dataset into memory

Prerequisites to create the visualizations with Python

Of course you need to have Python installed. I also use Juypter Lab for the code examples (if you prefer to work with VSCode, this is also a good alternative). I work with Anaconda (a Python distribution that makes it easier for you to get started). If you are not using Anaconda, you can install the packages below using pip. To ensure there are no conflicts between libraries, I create a separate environment using the following command:

conda create --name NameEnvironment python=3.10

In this article ‘Python Data Analysis Ecosystem — A Beginner’s Roadmap’ you will find more detailed instructions on how to get started.

After we have activated the environment (conda activate NameEnvironment) we install the libraries we need:

# Installing Packages for Python
conda install matplotlib seaborn pandas plotly folium

Once the libraries are installed, you can start Jupyter lab by entering ‘jupyter lab’ in the terminal.

Now we load a sample data set that is directly integrated into Pandas. Although the data from this sample dataset is less interesting than real data, it is easier to go through the code examples this way. The iris dataset consists of 150 observations of three iris flower species, each with four characteristics (sepal length, sepal width, petal length, petal width) and the corresponding flower species.

# Python-Visualization
# Importing the libraries
from sklearn.datasets import load_iris
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Loading the Iris dataset as a pandas DataFrame
iris = load_iris(as_frame=True)
df = iris.frame

# Displaying the first 5 rows of the dataset
print(df.head())

Creating basic plots

Histogram
First we create a histogram with R to visualize the distribution of horsepower (hp): We load the dataset and specify the dataset and column after ‘hist’. We use ‘main’ and ‘xlab’ to label the diagram and the axis. We define the fill colour with ‘col’ and the border colour with ‘border’.

# R-Visualization
# Histogram for Horsepower (hp) in the mtcars Dataset
hist(mtcars$hp,
main = "Histogram of Horsepower",
xlab = "Horsepower",
col = "skyblue",
border = "black") # Creates a histogram for the hp column

# Adding gridlines for better readability
grid(nx = NULL, ny = NULL, col = "gray", lty = "dotted")

Using Python, we create a histogram with matplotlib that shows the distribution of ‘sepal length’ in the Iris dataset: We use the bins to divide the data into 10 intervals. When you create a histogram, the bins indicate the level of granularity in the distribution display. Starting with 5 to 10 bins is often a good standard and gives you a solid overview of the distribution of your data. If you choose fewer bins, you will see more general trends but less detail. With more bins, you can recognize details and patterns more precisely, but the chart can appear confusing. Especially if you are working with smaller data sets (<100 values), 5–10 bins usually make sense.

# Python-Visualization
# Histogram for Sepal Length in the Iris Dataset
plt.hist(df['sepal length (cm)'], bins=10, color='skyblue', edgecolor='black') # Creates a histogram for Sepal Length
plt.title('Histogram of Sepal Length') # Adds a title to the plot
plt.xlabel('Sepal Length (cm)') # Labels the x-axis
plt.ylabel('Frequency') # Labels the y-axis

# Add gridlines for better readability
plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.show() # Displays the plot

Bar chart
We use the bar chart to compare categorical data. First, we analyze in R the frequency distribution of cars with different numbers of cylinders in the mtcars dataset: Instead of just selecting the dataframe and column as with the histogram, we need to use ‘table()’ to create a table that counts the frequency of each number of cylinders. This helps us understand how frequently each cylinder category appears in the dataset.

# R-Visualization
# Bar chart for cylinders in the mtcars dataset
barplot(table(mtcars$cyl),
main = "Number of Cylinders in Cars",
xlab = "Cylinders",
ylab = "Frequency",
col = "orange",
cex.names = 0.8) # Adjusts the axis label size

In Python, we create a bar chart to analyze the frequency of each target class (i.e. flower types) in the Iris dataset: With ‘[’target‘]’ we count how often each target class occurs in the DataFrame. Each class represents one of the three iris flower types. We then use the information in ‘plot()’ to specify that a bar chart with orange bars and a black border should be created. This helps us visualize the distribution of flower types in the dataset.

# Python-Visualization
# Bar chart for target classes in the Iris dataset
df['target'].value_counts().plot(kind='bar', color='orange', edgecolor='black') # Creates the bar chart
plt.title('Frequency of Target Classes') # Adds a title to the plot
plt.xlabel('Iris Flower Type') # Labels the x-axis
plt.ylabel('Frequency') # Labels the y-axis
plt.xticks(rotation=0) # Ensures x-axis labels are horizontal
plt.show() # Displays the plot

Scatter plot
We use the scatter diagram to visualize the relationship between two numerical variables. In R, we examine the relationship between engine power (hp) and car weight (wt). Such diagrams help us to recognize possible patterns or correlations between the variables. For example, we could test the hypothesis that heavier cars tend to have more hp. We begin by specifying the two variables we want to compare. With ‘pch=19’ we define the shape of the points — here the points are filled as a circle and with ‘col’ they are displayed in blue. Alternatively, you could use ‘pch=17’ to create a filled triangle or ‘pch=0’ to create an unfilled square for the points:

# R-Visualization
# Scatter plot for Horsepower (hp) vs. Car Weight (wt) in the mtcars dataset
plot(mtcars$hp, mtcars$wt,
main="Scatter Plot: Horsepower vs. Weight",
xlab="Horsepower",
ylab="Weight",
pch=19,
col="blue")

To add: To better identify trends, we can add a regression line by adding the following command:

# R-Visualization
# Adding a regression line
abline(lm(mtcars$wt ~ mtcars$hp), col="darkgreen", lwd=2)

In Python, we visualize the relationship between sepal length and sepal width. For example, we can check the hypothesis that larger sepals are also wider. In the first line, we enter the values for the x-axis and the y-axis and colour the points blue. With ‘alpha=0.7’ we specify that the points should be slightly transparent so that we can easily see the points when they overlap.

# Python-Visualization
# Scatter plot for Sepal Length vs. Sepal Width
plt.scatter(df['sepal length (cm)'], df['sepal width (cm)'], color='blue', alpha=0.7)
plt.title('Scatter Plot: Sepal Length vs. Sepal Width')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.show()

To add: For this small data set, we can use NumPy to calculate and plot a regression line. If the data set were larger, it would be better to use seaborn.

# Python-Visualization
# Calculate data for the regression line
x = df['sepal length (cm)']
y = df['sepal width (cm)']
m, b = np.polyfit(x, y, 1) # Calculates slope (m) and y-intercept (b)

# Scatter plot
plt.scatter(x, y, color='blue', alpha=0.7, s=50) # Add point size for better visibility
plt.plot(x, m*x + b, color='red', linewidth=2, label='Regression Line') # Add regression line
plt.title('Scatter Plot: Sepal Length vs. Sepal Width')
plt.xlabel('Sepal Length (cm)', fontsize=12)
plt.ylabel('Sepal Width (cm)', fontsize=12)
plt.legend() # Adds a legend to distinguish the regression line
plt.show()

Interactive plots with a geographical map

Finally, we want to visualize an interactive geographical map that marks locations you have already visited, for example. In R we use leaflet for this: First, we create an empty map object with ‘leaflet()’. With ‘addTiles()’ we add the default map tiles. And with ‘addMarkers()’ we add markers of the locations on the map.

# R-Visualization
library(leaflet)

# Interactive map with Leaflet
leaflet() %>%
addTiles() %>%
addMarkers(lng=-0.1278, lat=51.5074, popup="London") %>%
addMarkers(lng=2.3522, lat=48.8566, popup="Paris")

In Python, we use the library folium: A map is created with ‘folium.Map()’. ‘location’ specifies the starting position of the map. ‘zoom_start’ sets an initial zoom level so that we start with an overview of Europe. We then add the markers with ‘folium.Marker()’. And at the end we save the map as an HTML file that we can open in a browser.

# Python-Visualization
import folium

# Locations and their coordinates
locations = [
{"name": "London", "coords": [51.5074, -0.1278]},
{"name": "Paris", "coords": [48.8566, 2.3522]}
]

# Interactive map with Folium
m = folium.Map(location=[51.5074, -0.1278], zoom_start=5) # Initial view set to London

# Add markers for each location
for location in locations:
folium.Marker(location["coords"], popup=location["name"]).add_to(m)

# Save the map as an HTML file
m.save('map.html')

4 — Advantages and disadvantages: Comparing R & Python for Data Science

R was developed specifically for statistics and data analysis, and you can see this when using it.

But what are the strengths and weaknesses of this programming language?

Advantages of R

  • Strength in statistics & data analysis
    R is specially optimized for statistical operations and visualizations. This is also quickly noticeable when using the language. R also offers extensive functions for regressions, hypothesis tests and data modeling.
  • Community & Open Source
    R has a very active community that is constantly developing new packages and through which you can find many resources on the Internet. In addition, the language is open source and therefore free and accessible to all.
  • Integration into other environments
    You can easily integrate R into other environments. For example, you can integrate R into Jupyter Notebooks for interactive analyses, using shiny dashboards to display results or using packages such as ‘httr’ and ‘jsonlite’ to use APIs. R allows you to directly access relational databases through packages like ‘DBI’ and ‘RPostgreSQL’.

And what are the disadvantages or R?

  • Steeper learning curve
    Compared to Python, the syntax can be more difficult for beginners. In addition, the error messages are often less intuitive. An example of less intuitive error messages in R is the common ‘object not found’ error. This message appears when you reference a variable or object that doesn’t exist in your environment, often due to a simple typo or forgetting to define the variable.
  • Performance
    R can be slower than Python when processing large datasets. R is also less suitable for machine learning and deep learning: While R offers some support for machine learning through libraries like ‘caret’ or ‘randomForest’, it is not as comprehensive as Python's frameworks like TensorFlow and PyTorch for deep learning.
  • Less broad
    R is primarily specialized for use in statistics. For other tasks such as web development or machine learning, R cannot be used to the same extent as Python.

Python is one of the most versatile and widely used programming languages. But even Python is not equally suitable for everything:

Let’s start with the advantages of Python:

  • Easy to learn
    Python has a very intuitive & easy to understand syntax that has many similarities to English. If you are a beginner, Python is usually recommended. For example, consider the following Python code for a conditional statement:
    if age >= 18:
    print(“You are an adult.”)
    else:
    print(“You are not an adult.”)
  • Versatility
    The great thing about this language is that it is not only suitable for data science. Once you have mastered the language, you can also use it for automation, web development and, of course, machine learning.
  • Large community
    Like R, Python has a huge community for libraries and resources. Python is also open-source and therefore freely available to everyone.
  • Performance
    With libraries such as NumPy, Pandas and Dask, Python has good libraries for processing large amounts of data efficiently. Check out one of my articles ‘A Practical Dive into NumPy and Pandas: Data Analysis with Python’ for an overview of Numpy and Pandas.

And what are the disadvantages of Python?

  • Not as specialized in statistics
    If you want to perform complex statistical analyses and your focus is clearly on statistics, R is probably the better choice.
  • Performance
    If you compare Python with other programming languages such as C++ or Java, which are compiled languages, Python can be slower. You can minimize this disadvantage with libraries such as NumPy or Cython.

5 — Final Thoughts

When to use R? When to use Python?

If you are new to programming or are looking for a versatile language, Python is likely the best choice. However, both programming languages are a good choice. R might be the better option if your primary focus is on statistics and data visualization.

But almost more important than choosing between R and Python is that you understand the basic principles of data analysis: How can you clean raw data and prepare it for analysis? What steps do you need to take to perform exploratory data analysis (EDA) to recognize patterns and relationships in your data? Which visualizations are best suited to show your results to others?

In addition to R and Python, there are other languages and tools that are important in data analysis and visualization. One of them is Julia, which is particularly fast and efficient for numerical calculations and scientific computing. There is also MATLAB, which has powerful visualization and calculation functions. It’s commonly used in academia and engineering for its robust computational capabilities and ease of use in specific domains. However, it’s relatively expensive and less flexible. Tableau and Power BI are excellent tools for creating interactive visualizations without requiring programming skills — and are widely used in business environments. And of course, there is still Excel, which allows practically anyone to create many visualizations very easily without having to know a programming language. While Excel is an excellent tool for beginners, its limitations become apparent when handling larger datasets.

Where can you continue learning?


The Essential Guide to R and Python Libraries for Data Visualization was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from Datascience in Towards Data Science on Medium https://ift.tt/ZGgyF6q
via IFTTT

También Podría Gustarte