Um blog sobre nada

Um conjunto de inutilidades que podem vir a ser úteis

Easiest way to install xgboost on windows (download binaries – no need to compile )

Posted by Diego em Abril 5, 2017

1) (I am assuming both git and Anaconda are already installed).

2) Choose a place to have the installer files and clone the git repo:


git clone xgboost_install




3) Download the libxgboost.dll file from here and copy it to the xgboost folder on: <install_dir>\python-package\xgboost\





4) Navigate to the python_package folder and run:

python install


That should work fine.

If, however, you get the error bellow – like I did – when trying to import the library:



WindowsError: [Error 126] The specified module could not be found


here’s what I recommend:

After some debugging I found out the problem was on the from .core import DMatrix, Booster command, more specifically, on the “_load_lib()” function inside Core trying to run this line:


lib = ctypes.cdll.LoadLibrary(lib_path[0])


where lib_path[0] was precisely the file path for the libxgboost.dll I had just copied to the xgboost folder.

Since I was sure the file existed, I realized that maybe the DLL depended on other DLLs that could not be found. To check that, I downloaded dependency walker from this link, which showed me that the required VCOMP140.DLL was missing:







After some goggling, I discovered that the most common cause for that is that my machine did not have the C++ runtime installed, which I downloaded from here and eventually solved my problem:



Posted in Data Science, Machine Learning, Python, xgboost | Leave a Comment »

Scripting body and signature of Functions (Actian Matrix \ Redshift)

Posted by Diego em Fevereiro 19, 2017


Useful if you want to automatically manage permissions on the functions since you have to include the function signature on the grant\revoke statement.


SELECT proname, n.nspname||'.'||p.proname||'('||pg_catalog.oidvectortypes(p.proargtypes) ||')' as signature, prosrc as body
FROM pg_catalog.pg_namespace n 
JOIN pg_catalog.pg_proc p ON pronamespace = n.oid 




Posted in Actian Matrix \ ParAccel, I.T., SQL | Leave a Comment »

How to move a VM file to another location

Posted by Diego em Novembro 25, 2016

1) Copy the "VirtualBox VMs" folder from its current location to the new location you desire;

3) Change the "Default Machine Folder" to the new location (go to File -> preferences -> General):



4) On the Virtual Box Manager, right click on your VM and click "Remove" -> “Remove only”.


5) Close and then reopen VM Manager

6) Go to “Machine” -> “Add” (it should default to the new folder) and re-add the VM

Posted in Uncategorized | Leave a Comment »

How to connect to PostgreSQL running on an Ubuntu VM

Posted by Diego em Setembro 19, 2016


FYI, PostgreSQL can be installed using:



sudo apt-get update
sudo apt-get install postgresql postgresql-contrib



Optionally, install pgAdminIII ( .to test the connectivity.


Virtual Box creates virtual machines with the NAT network type by default. If you want to run server software inside a virtual machine, you’ll need to change its network type or forward ports through the virtual NAT.

With the NAT network type, your host operating system performs network address translation. The virtual machine shares your host computer’s IP address and won’t receive any incoming traffic. You can use bridged networking mode instead — in bridged mode, the virtual machine will appear as a separate device on your network and have its own IP address.

To change a virtual machine’s network type in VirtualBox, right-click a virtual machine and select Settings, go to “network” and change the “attached to” option to “Bridged Adapter”




You can check your VM’s new IP by typing “ifconfig” or clicking on the top right corner icon -> “System Settings” -> “Network”:





Then, navigate to Postgres’ installation folder (normally on: /etc/postgresql/9.5/main) and edit the postgresql.conf file setting it to whatever suits you (I set it to all):



sudo vi postgresql.conf



sudo systemctl restart postgresql  #restart


That will make PostgreSQL to listen to all IPs, as we can see on this before\after:



(That will allow you to connect locally using the server’s IP address, before this step you’d be able to connect only using “localhost”).

At last, edit the pg_hba.conf ( ) file, which controls client authentication, and add a row that allows all connections:



sudo vi pg_hba.conf




sudo systemctl restart postgresql



By doing so, you should be able to access PostgreSQL from outside your VM.

Posted in PSQL | Leave a Comment »

TensorFlow working on an Ubuntu virtual Machine

Posted by Diego em Agosto 5, 2016


This is a quick step-by-step on how to get TensorFlow working on an Ubuntu virtual Machine. The initial steps are very high level because it not hard to find tutorials or documentation that supports them.

1) Download and install Oracle Virtual Box from


2) Download an Ubuntu image from

3) Create a new VM (make sure to always run “as administrator” – you may get error otherwise):


4)Once the VM is up and running, install software to help development (Type CTRL + ALT + t -> shortcut to open a new terminal)



Guest Additions:

·   sudo apt-get install virtualbox-guest-utils virtualbox-guest-x11 virtualbox-guest-dkms


·         Download from:

·         Run: sudo dpkg -i /home/diego/Downloads/<DownloadedFile>

·         sudo apt-get install -f


·         sudo apt-get install htop

Anaconda (Jupyter notebooks – optional if you want to run the Udacity examples):


·         bash

·         Restart the VM or type: “. .bashrc” (two dots, without the quotes) on your home directory to update the PATH

·         Type jupyter notebook to start it


If not Installed Anaconda (probably using /usr/bin/python):


·         Run: sudo apt-get install python-pip python-dev

·         # For Ubuntu/Linux 64-bit, CPU only, Python 2.7

·         $ export TF_BINARY_URL=

·         # Python 2:  $ sudo pip install –upgrade $TF_BINARY_URL


If Installed Anaconda (probably using /home/<user>/anaconda/….)::

·         # Python 2.7:  conda create -n tensorflow python=2.7

·         conda install -c conda-forge tensorflow







Mounting a network driver:


* Mount:

Go to the settings on your VM and add add folder:



On Unix run:

sudo mount -t vboxsf TensorflowVMShared /home/diego/Documents/myMountedFolder



TensorflowVMShared is the Windows alias you created and

/home/diego/Documents/myMountedFolder is the folder on Unix



* See a list of mounts:

cat /proc/mounts


df –aTh


* Remove the mounts:

sudo umount -f /home/diego/Documents/myMountedFolder




$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java8-installer

Download and unpack Pycharm (move it to the desired folder, like “opt”)

Run from the bin subdirectory


Creating a launcher icon:

Posted in Data Science, Deep Learning, Machine Learning | Leave a Comment »

How to install PyGame using Anaconda

Posted by Diego em Junho 14, 2016


Search the package using:

binstar search -t conda pygame





Install the package for the desired platform:


conda install -c pygame




Posted in Python | Leave a Comment »

Docker Containers on Windows

Posted by Diego em Fevereiro 23, 2016

This post is a follow up on kaggle’s “How to get started with data science in containers” post
I was having some trouble setting up the environment on windows so I decided to document my steps\discoveries.
I also (like to think that) I improved the script provided a little by preventing multiple containers from being created.


As the post above indicates, the first thing you have to do is to install Docker:

Because the Docker Engine daemon uses Linux-specific kernel features, you can’t run Docker Engine natively in Windows. Instead, you must use the Docker Machine command, docker-machine, to create and attach to a small Linux VM on your machine.

The installer is really strait forward and after it is completed, it will have a link to the Quick start Terminal in your desktop. When you run it, the “default” machine will be started, but as is mentioned on the kaggle’s post, “ it’s quite small and struggles to handle a typical data science stack”, so it walks you through the process of creating a new VM called “docker2”. If you followed the steps, you will be left with two VMs:




To avoid having to run

docker-machine start docker2
eval $(docker-machine env docker2)

to change the VM every time you launch your CLI, you can edit the script it calls and change the VM’s name so it will also start the docker2 VM


image                          image




When you login you should be in your user folder “c:\users\<your_user_name>.

You can see that I already have Kaggle’s python image ready to be used and that I have no containers running at the moment:





To make thing simpler (and to avoid seeing all the files we normally have on our user folder), I created and will be working inside the _dockerjupyter folder.

Inside that folder, I created a script called StartJupyter, which is based on the one found on the tutorial, but with a few modifications.

There is of course room for improvement on this script – like making the container name, the port and maybe the VM name (in case you chose anything else than “docker2”) parameterized.



RUNNING=$(docker inspect -f {{.State.Running}} KagglePythonContainer 2> /dev/null) #redirect to STDERR if can't find the KagglePythonContainer (first run)

if [ "$RUNNING" == "" ]; then  
	echo "$CONTAINER does not exist - Creating"
	(sleep 3 && start "http://$(docker-machine ip docker2):8888")&	  
	docker run -d -v $PWD:/tmp/dockerwd -w=/tmp/dockerwd -p 8888:8888 --name KagglePythonContainer -it kaggle/python jupyter notebook --no-browser --ip="" --notebook-dir=/tmp/dockerwd

if [ "$RUNNING" == "true" ]; then  
	echo "Container already running"
	echo "Starting Container" 	
	docker start KagglePythonContainer	

#Launch URL
start "http://$(docker-machine ip docker2):8888"  


The two main modifications are the “-d” flag that runs the container on the background (detached mode) and the “– name” that I use to control the container’s state and existence. This is important to avoid creating more than one container, which would cause a conflict (due to the port assignment) and leave it stopped. This is what would happen:



The ”-v $PWD:/tmp/dockerwd” will map your current working directory to the “:/tmp/dockerwd” inside the container so Jupyter’s initial page will show the content of the folder you are at.

Running the code will create and start the container in detached mode:




It will also open the browser on your current working directory where you’ll probably only see the StartJupyter script you just ran. I’ve also manually added a “test.csv” file to test the load:




By creating a new notebook, you can see the libraries’ location (inside the container), see that the notebook is indeed working from /tmp/dockerwd and read files from your working directory (due to the mapping made) :





Now, since we started the container in detached mode (-d), we can connect to It using the exec command and by doing so, we can navigate to the /tmp/dockerwd folder, see that our files are there and even create a new file, which will of course be displayed on the Jupyter URL:

docker exec -it KagglePythonContainer bash





We can also see that we have 2 different processes running (ps –ef):

FYI: you will need to run

apt-get update && apt-get install procps 

to be able to run the ps command





At last, as mentioned before, the script will never start more than one container. So, if for some reason if you stop your container (if your VM is restarted for example), the script will just fire up the container named “KagglePythonContainer”, or it wont do anything if you call the script with the container already running. In both cases the Jupyter’s URL will always be displayed:




In order to understand docker, I highly recommend these tutorials:

Posted in Data Science, Docker, I.T., Python | Leave a Comment »

Dynamic Parameters on Tableau

Posted by Diego em Outubro 29, 2015


This is my solution to deal with Dynamic Parameters on Tableau.

Just to recap, the problem is related to the fact that, once a parameter list is populated on Tableau Desktop and the workbook is deployed to the server, is it not possible to update the list if new values are required.


I should start by saying that this is not a simple solution. 

I mean, the code actually is but unfortunately there is more to it, mainly because it will require you to republish the dashboard using tabcmd (or any other means) each time you want to change the parameter.

But that’s what I’ve got until Tableau decides to support this so requested functionality.



My idea is to use python to access the workbook’s XML, navigating the tree to the desired parameter (on my example called “MyParameter”), clear it’s values (except for the value “None” – this is my particular use case and can be removed very easily) and then inject the desired new values based on a list.



This is a before\after example – as expected, the list has been updated:







And here is the core code:

def ParameterUpdate(envname, dashboardname):
    #the 5 lines bellow are optional. It can be used if the Source\Destination Dashboards are expected to be on a
    #different path than the one the python file is executing, which is often the case, but not for this example:
    configfile = resource_filename(__name__, 'tableau.ini')
    config = ConfigParser.ConfigParser()
    inputdir = config.get(envname, "InputDirectory")
    outputdir = config.get(envname, "OutputDirectory")

    #reads the twb file
    tableaufile = ElementTree.parse('{0}/{1}.twb'.format(inputdir, dashboardname))

    #gets a list with the new parameters
    new_parameter_list = __getNewParameterlist()

    #gets the root XML
    fileroot = tableaufile.getroot()

    #finds the parameter called "MyParameter"
    xmlsection = fileroot.find("datasources/datasource/column[@caption='{0}']/members".format("MyParameter"))

    #Inject the new list into the XML file
    __injectParameters(xmlsection, new_parameter_list)

    newxmlfile = ElementTree.tostring(fileroot, encoding='UTF-8', method='xml')

    filelocation = '{0}/{1}.twb'.format(outputdir, "Output")
    with open(filelocation, 'w') as text_file:




Of course, this solution requires a whole messaging system between whatever process triggers the parameter update, the python code and the tab cmd that deploys the dashboard, which I won’t be covering here because it is out of this post’s scope.


The complete code and a working examples can be fully downloaded here.




Due to the nature of ElementTree, this code only works on python 2.7 (Not sure about latter versions – I’d imagine it does; pretty sure it doesn’t work on 2.6)



I have been recently shown by one of the product managers on Tableau this other workaround using java script on the server:

Which I’d honestly consider hadn’t I spent time writing my own solution.

Posted in Python, Tableau | Leave a Comment »

The Poisson distribution

Posted by Diego em Setembro 24, 2015


The Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event.


Example: let’s say that you are interested in knowing the probability that X cars  are going to pass in front your house in a given period of time.
To do that, you need to assume two things:


1)      Any period of time is no different than any other period of time;

2)      The number of cars that pass in one period doesn’t not affect the number of cars that will pass on the next period;



Give those assumptions, you need to find out how many cars on average pass in front of your house (that is your lambda  – λ).

If you want know the probability that K cars will pass, all you have to do is substitute λ and k on the formula bellow:





For example, lest say that the average is 9 cars and you want to know the probability that exactly 2 cars will pass:

λ =9



81 /2  * 0.000123   =  0.004998097


Or 0.5%



That same calculation can be can be achieved in R using the density dpois function:



[1] 0.004998097




To compute cumulative probabilities (for example, probability of 0 or 1 or 2 or 3 cars ), we can use the ppois function (which is the same as summing all individual probabilities with dpois):

[1] 0.02122649



Probability of having seventeen or more cars passing by your house:

ppois(16, lambda=9, lower=FALSE) #upper tail 
[1] 0.01110591


1-ppois(16, lambda=9) #lower tail
[1] 0.01110591




To finalise, we can simulate how many cars will pass on the next 10 hours:

[1]  9  9 11 15  7  8  8 13 11  9

Posted in R, Statistics | Leave a Comment »

Principal Component Analysis

Posted by Diego em Agosto 18, 2015


In this post I’ll be using R to run Principal Component Analysis on the same dataset used on this which is part of the  Udacity Data Analyst Nanodegree. The process I’ll follow is pretty much the same as the one on the webcast but I hope to be adding some extra information to it. I also would like to than Udacity, Sheng Kung and Nikolay for the amazing webcast and especially for the detailed answers to my questions on the forums.


1)      What is PCA and what is it good for?


·         Very efficient and automatic way of combining features in order to simplify a very large and complicated dataset;

·         If we have 2 features, we want to make sure that they are as independent as possible from each other; that each one contains the “purest” kind of information on their own;

·         It does that by trying to identify correlation between features by considering a new feature that is some sort of hybrid between others, to make the model simpler and easier to fit on the data;

·         One down side is that PCA are vectors over the initial feature so they can be harder to interpret than the initial features;

·         One of the assumptions on PCA is that the data is continuous and that it follows roughly a normal distribution. There are alternate techniques that handle other types of data;


2)      Data

As I said, I’ll be using the same dataset used on the website but to make the process easier I skipped the data cleaning process and am loading a “ready-to-go” dataset. You can find details about the dataset and the cleaning process on this ipython notebook. (I downloaded it from box 15 on the notebook).

The data set is pretty simple, it contains information about 105 countries around the globe which will all be considered as features for the sake of the PCA example.


df <- read.csv ("cleaned.csv")








































3)      Running PCA:

PCA can be run using the prcomp command.  The command offers the possibility of normalizing the data using the “center” and “scale” parameters (more info about normalizing can be found here). The reason you should always normalize the data is, imagine that you are trying to run PCA on 2 features A and B that range from, let’s say, A: 1 and 100 and B: 0 and 10. If you do that without normalizing the data, PCA is going to find a large principal component on the A variable so it is going to follow it rather than trying to find any correlation between the features.

Here I’m calling the prcomp command with center and scale = TRUE (If you omit these parameters you’ll see that the values outputted (weights) will be quite different):


prcomp(df[,2:6],center = TRUE,scale = TRUE)
Standard deviations:
[1] 1.8207722 0.9510142 0.7172067 0.4642723 0.2245580

                         PC1         PC2        PC3         PC4          PC5
completion_male   -0.5028374 -0.01732469  0.5155280  0.05640884 -0.691305540
completion_female -0.5088654 -0.01884488  0.4716004 -0.06446605  0.717034253
income_per_person -0.4686439  0.23200303 -0.4755441 -0.70319643 -0.076942116
employment         0.2264034  0.93671022  0.2659600 -0.02261103  0.008335095
life_expectancy   -0.4661813  0.26094689 -0.4636238  0.70545424  0.044373047

We’ll be looking at these values in a moment but to exemplify what the normalization is doing, I will manually normalise the initial dataset by subtracting each one of its value by the feature’s mean and dividing by its standard deviation:


  df$completion_male &lt;- ((df$completion_male - mean(df$completion_male)) / sd(df$completion_male))
  df$completion_female &lt;- ((df$completion_female - mean(df$completion_female)) / sd(df$completion_female))
  df$income_per_person &lt;- ((df$income_per_person - mean(df$income_per_person)) / sd(df$income_per_person))
  df$employment &lt;- ((df$employment - mean(df$employment)) / sd(df$employment))
  df$life_expectancy &lt;- ((df$life_expectancy - mean(df$life_expectancy)) / sd(df$life_expectancy))

Here’s how the Normalized data looks like:






































Now, if I run PCA with the center and scale parameters = FALSE on the normalized data, you can see that I get the same result as the previous command.


pca&lt;- prcomp(df[,2:6],center = FALSE,scale = FALSE) 

Standard deviations:
[1] 1.8207722 0.9510142 0.7172067 0.4642723 0.2245580

                         PC1         PC2        PC3         PC4          PC5
completion_male   -0.5028374 -0.01732469  0.5155280  0.05640884 -0.691305540
completion_female -0.5088654 -0.01884488  0.4716004 -0.06446605  0.717034253
income_per_person -0.4686439  0.23200303 -0.4755441 -0.70319643 -0.076942116
employment         0.2264034  0.93671022  0.2659600 -0.02261103  0.008335095
life_expectancy   -0.4661813  0.26094689 -0.4636238  0.70545424  0.044373047

4)      Principal Components:

Each component corresponds to the coefficients (weights) to be applied to each feature (the data points) in order to get to the new resulting point. PCA can come up with a number of orthogonal components that is equal to the number of features we fit, so if we have 5 features on the model, the maximum number of principal components is 5.  By running a summary on the pca object, we can see that the first PCA explain 66% of the data’s variance, the second 18% and so on:


Importance of components:
                         PC1    PC2    PC3     PC4     PC5
Standard deviation     1.821 0.9510 0.7172 0.46427 0.22456
Proportion of Variance 0.663 0.1809 0.1029 0.04311 0.01009
Cumulative Proportion  0.663 0.8439 0.9468 0.98991 1.00000


The prcomp function offers a “tol” parameter that can be used to limit the number of component to be returned. The value indicates the magnitude below which components should be omitted. (Components are omitted if their standard deviations are less than or equal to “tol” times the standard deviation of the first component.) . For example, had we called the function above with tol =0.4, we’d have gotten only 2 principal components because 1.821 * 0.5 = 0.9105 and PC3’s standard deviation is 0.7172

The principal components can be found by running pca[2] which will output a matrix containing the features as rows and the principal components as columns:








































Note that all 5 principal components are generated from all five features (principal components are a linear combination over all features). If we had requested prcomp to generate a different number or pcs the order of them would be the same since the data isn’t actually changing. For example, PC1 will always be (-0.5028374, -0.5088654 and so on…) since those vectors should still be the directions of maximal variance on the five features we are investigating.

Even though each principal component is generated from all five features, one thing that you might find from your principal component transformation is that there may be certain features that are highly correlated with one another and you might see a principal component that has a very large weight on a handful of features.

Features with similar coefficients in the same direction (both positive or both negative) are obviously similar, but features of the same magnitude but opposite directions are also interesting for their negative correlations. In terms of overall magnitude, coefficients do not need to all be large, only similar.

The overall length of the principal component vector is 1 (the sum of each weight squared), so if we observe a coefficient that has a large magnitude, then we know that most of the weight of that principal component is on that single feature and the rest of the weights will be relatively small. You’ll tend to see correlated features have smaller coefficient values, due to the need to ‘share’ weight on each of the features.

In this case, we can see that the second principal component represents the variation contributed by the “employment” feature.


Another interesting feature of the pca object is X (fifth feature on the pca list), which contains the transformed data (the original data with PCA weights applied):$x[1:5,])

































So, for example, if we get the first row of our ds dataset, Argentina, (remember that it contains the normalized values because I did it manually) and do the matrix multiplication between its values and the first principal component, we’ll get cell pca$x[1,1]:



















0.500062073   * 0.5028374 +

0.685201535   * 0.5088654 +

0.30719635    * 0.4686439 +

0.016051348   * 0.2264034  +

0.63137587   * 0.4661813 = 1.0348



5)      PCA’s Properties:

The resulting components from PCA are independent from each other. The information contained in one doesn’t tell anything about the value that is on the other features. So when we fit it to a model, each of these features stands on its own by telling something different between the relationship between our predictors and out outcome. Also, the components are orthogonal to each other and normalized to have length 1:

components <- pca[2]
#Components are orthogonal to each other (dot product = 0)
components$rotation[,3] %*% components$rotation[,1]
[1,] 1.387779e-16
#Components are normaled to have length 1.
[1] 1



6)      Plotting:

If we plot this in the principal component space (where each dot represents a country plotted with its 2 PCAs – let’s use PC1 and PC2), with an arrow pointing from the origin to the point implied by the row’s coordinates, we can see how well each feature aligns with each principal component. If we see an arrow aligned strongly with one of the axes, then we can say that the principal component on that axis may represent the feature fairly well. (For example, the fourth row has a very strong alignment with the second principal component.)

We can also see, if multiple vectors point in the same direction, how much similarity there is between features. (This can be seen in the first and second rows of the array, as well as the third and fifth rows.)

plot(pca$x[,1],  pca$x[,2])
arrows(0,0,pca1[1]* arrow_size,pca2[1]* arrow_size) 
arrows(0,0,pca1[2]* arrow_size,pca2[2]* arrow_size)
arrows(0,0,pca1[3]* arrow_size,pca2[3]* arrow_size)
arrows(0,0,pca1[4]* arrow_size,pca2[4]* arrow_size)
text(pca1[4]* arrow_size,pca2[4]* arrow_size, row.names(pca$rotation)[4])
arrows(0,0,pca1[5]* arrow_size,pca2[5]* arrow_size)
























7)    Extra points on Interpreting the results:

The explained variance on each principal component can be very useful on the interpretation. Since there is a good amount of variance explained by both the first and second principal component, we can take a look at both of these components together when trying to see if there are features that go well with one another. If the first component had a higher amount of explained variance and the second component a smaller amount of explained variance, then we might well believe that there is only one major signal in the data that is covered by all of the components except for the fourth. With the data as it is, the second component is important enough that we might be better-suited to consider the first and second features as one pair of related features, and the third and fifth as a second pair.

If we see this behaviour in our principal components, then we can actually go back and create hybrid features and then substitute them in for our original features in the original feature space. When it comes to creating composite features based on the results of PCA, it certainly helps to have domain knowledge. And if you don’t have that knowledge, it is a good opportunity to do some additional exploration and research to see if you can understand the underlying relationship between features. If you have documentation for your dataset, look there; performing additional visualizations can also be a big help to understanding the data. Data analysis is a process where you do need to move between steps in an iterative fashion, from wrangling to exploration to modelling.

As for selecting a good number of principal components to reduce down to, you can look at the amount of variance explained by each component. If you start out with a large number of components, then you can see the importance of each component on explaining variation in the original data. Once the values get very small, you can make an assumption that the components are beginning to fit to the random noise in the data rather than the true signal. If you look at the size of the explained variance values or the trend in the cumulative sum of explained variance coefficients (visualizing these in plots will be quite useful), this can help you make a judgement on how many components you want to include in your algorithm.

Interpretation of the output of principal component analysis is quite tricky since there are a lot of moving parts and principal components may not always be easily interpretable. I don’t think there’s any hard, deterministic rules to follow; one of the most important skills that you can have is to be able to synthesize multiple bits of information together and make your own conclusions.

However, it’s also important to realize in practice, it may not be possible to see such clean correlations and relationships between features. If we have a lot of features, it may be difficult to find a coherent interpretation for each principal component. This is one of the downsides of PCA that it may be a good dimensionality reduction technique, but it certainly has the possibility of returning un-interpretable components.

Posted in PCA, Statistics | Leave a Comment »