Um blog sobre nada

Um conjunto de inutilidades que podem vir a ser úteis

How to move a VM file to another location

Posted by Diego em Novembro 25, 2016

1) Copy the "VirtualBox VMs" folder from its current location to the new location you desire;

3) Change the "Default Machine Folder" to the new location (go to File -> preferences -> General):

clip_image001

 

4) On the Virtual Box Manager, right click on your VM and click "Remove" -> “Remove only”.

clip_image002

5) Close and then reopen VM Manager

6) Go to “Machine” -> “Add” (it should default to the new folder) and re-add the VM

Posted in Uncategorized | Leave a Comment »

How to connect to PostgreSQL running in an Ubuntu VM

Posted by Diego em Setembro 19, 2016

 

FYI, PostgreSQL can be installed using:

 

                                

sudo apt-get update
sudo apt-get install postgresql postgresql-contrib

 

 

Optionally, install pgAdminIII (https://www.pgadmin.org/download/windows.php) .to test the connectivity.

 

Virtual Box creates virtual machines with the NAT network type by default. If you want to run server software inside a virtual machine, you’ll need to change its network type or forward ports through the virtual NAT.

With the NAT network type, your host operating system performs network address translation. The virtual machine shares your host computer’s IP address and won’t receive any incoming traffic. You can use bridged networking mode instead — in bridged mode, the virtual machine will appear as a separate device on your network and have its own IP address.

To change a virtual machine’s network type in VirtualBox, right-click a virtual machine and select Settings, go to “network” and change the “attached to” option to “Bridged Adapter”

 

clip_image001[4]

 

You can check your VM’s new IP by typing “ifconfig” or clicking on the top right corner icon -> “System Settings” -> “Network”:

 

clip_image003[4]

 

 

Then, navigate to Postgres’ installation folder (normally on: /etc/postgresql/9.5/main) and edit the postgresql.conf file setting it to whatever suits you (I set it to all):

 

                           

sudo vi postgresql.conf

clip_image005[4]

                      

sudo systemctl restart postgresql  #restart

 

That will make PostgreSQL to listen to all IPs, as we can see on this before\after:

clip_image007[4]

 

(That will allow you to connect locally using the server’s IP address, before this step you’d be able to connect only using “localhost”).

At last, edit the pg_hba.conf (https://www.postgresql.org/docs/9.5/static/auth-pg-hba-conf.html ) file, which controls client authentication, and add a row that allows all connections:

 

                   

sudo vi pg_hba.conf

clip_image009[4]

 

                   

sudo systemctl restart postgresql

 

 

By doing so, you should be able to access PostgreSQL from outside your VM.

Posted in PSQL | Leave a Comment »

TensorFlow working on an Ubuntu virtual Machine

Posted by Diego em Agosto 5, 2016

 

This is a quick step-by-step on how to get TensorFlow working on an Ubuntu virtual Machine. The initial steps are very high level because it not hard to find tutorials or documentation that supports them.

1) Download and install Oracle Virtual Box from https://www.virtualbox.org/wiki/Downloads

 

2) Download an Ubuntu image from http://www.ubuntu.com/download

3) Create a new VM (make sure to always run “as administrator” – you may get error otherwise):

image

4)Once the VM is up and running, install software to help development (Type CTRL + ALT + t -> shortcut to open a new terminal)

 

 

Guest Additions:

·   sudo apt-get install virtualbox-guest-utils virtualbox-guest-x11 virtualbox-guest-dkms


Sublimetext:

·         Download from: sublimetext.com

·         Run: sudo dpkg -i /home/diego/Downloads/<DownloadedFile>

·         sudo apt-get install -f


htop:

·         sudo apt-get install htop

Anaconda (Jupyter notebooks – optional if you want to run the Udacity examples):

·         https://www.continuum.io/downloads

·         bash Anaconda2-4.1.1-Linux-x86_64.sh

·         Restart the VM or type: “. .bashrc” (two dots, without the quotes) on your home directory to update the PATH

·         Type jupyter notebook to start it

TensorFlow:

If not Installed Anaconda (probably using /usr/bin/python):

 

·         Run: sudo apt-get install python-pip python-dev

·         # For Ubuntu/Linux 64-bit, CPU only, Python 2.7

·         $ export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.10.0rc0-cp27-none-linux_x86_64.whl

·         # Python 2:  $ sudo pip install –upgrade $TF_BINARY_URL

 

If Installed Anaconda (probably using /home/<user>/anaconda/….)::

·         # Python 2.7:  conda create -n tensorflow python=2.7

·         conda install -c conda-forge tensorflow

 

 

image

 

 

 

Mounting a network driver:

 

* Mount:

Go to the settings on your VM and add add folder:

image

 

On Unix run:

sudo mount -t vboxsf TensorflowVMShared /home/diego/Documents/myMountedFolder

 

where:

TensorflowVMShared is the Windows alias you created and

/home/diego/Documents/myMountedFolder is the folder on Unix

 

 

* See a list of mounts:

cat /proc/mounts

OR

df –aTh

 

* Remove the mounts:

sudo umount -f /home/diego/Documents/myMountedFolder

 

Pycharm:

Run:

$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java8-installer

Download and unpack Pycharm (move it to the desired folder, like “opt”)

Run pycharm.sh from the bin subdirectory

 

Creating a launcher icon:

http://askubuntu.com/questions/330781/how-to-create-launcher-for-application

https://help.ubuntu.com/community/UnityLaunchersAndDesktopFiles

Posted in Data Science, Deep Learning, Machine Learning | Leave a Comment »

How to install PyGame using Anaconda

Posted by Diego em Junho 14, 2016

 

Search the package using:

binstar search -t conda pygame

 

package

 

 

Install the package for the desired platform:

 

conda install -c https://conda.binstar.org/prkrekel pygame

 

 

package2

Posted in Python | Leave a Comment »

Docker Containers on Windows

Posted by Diego em Fevereiro 23, 2016

This post is a follow up on kaggle’s “How to get started with data science in containers” post
I was having some trouble setting up the environment on windows so I decided to document my steps\discoveries.
I also (like to think that) I improved the script provided a little by preventing multiple containers from being created.

 

As the post above indicates, the first thing you have to do is to install Docker: https://docs.docker.com/windows/

Because the Docker Engine daemon uses Linux-specific kernel features, you can’t run Docker Engine natively in Windows. Instead, you must use the Docker Machine command, docker-machine, to create and attach to a small Linux VM on your machine.

The installer is really strait forward and after it is completed, it will have a link to the Quick start Terminal in your desktop. When you run it, the “default” machine will be started, but as is mentioned on the kaggle’s post, “ it’s quite small and struggles to handle a typical data science stack”, so it walks you through the process of creating a new VM called “docker2”. If you followed the steps, you will be left with two VMs:

 

image

 

To avoid having to run

docker-machine start docker2
eval $(docker-machine env docker2)

to change the VM every time you launch your CLI, you can edit the start.sh script it calls and change the VM’s name so it will also start the docker2 VM

 

image                          image
     

 

image

 


When you login you should be in your user folder “c:\users\<your_user_name>.

You can see that I already have Kaggle’s python image ready to be used and that I have no containers running at the moment:

 

image

 

 

To make thing simpler (and to avoid seeing all the files we normally have on our user folder), I created and will be working inside the _dockerjupyter folder.

Inside that folder, I created a script called StartJupyter, which is based on the one found on the tutorial, but with a few modifications.

There is of course room for improvement on this script – like making the container name, the port and maybe the VM name (in case you chose anything else than “docker2”) parameterized.

 

CONTAINER=KagglePythonContainer

RUNNING=$(docker inspect -f {{.State.Running}} KagglePythonContainer 2> /dev/null) #redirect to STDERR if can't find the KagglePythonContainer (first run)

if [ "$RUNNING" == "" ]; then  
	echo "$CONTAINER does not exist - Creating"
	(sleep 3 && start "http://$(docker-machine ip docker2):8888")&	  
	docker run -d -v $PWD:/tmp/dockerwd -w=/tmp/dockerwd -p 8888:8888 --name KagglePythonContainer -it kaggle/python jupyter notebook --no-browser --ip="0.0.0.0" --notebook-dir=/tmp/dockerwd
	exit
fi


if [ "$RUNNING" == "true" ]; then  
	echo "Container already running"
else
	echo "Starting Container" 	
	docker start KagglePythonContainer	
fi 

#Launch URL
start "http://$(docker-machine ip docker2):8888"  

 

The two main modifications are the “-d” flag that runs the container on the background (detached mode) and the “– name” that I use to control the container’s state and existence. This is important to avoid creating more than one container, which would cause a conflict (due to the port assignment) and leave it stopped. This is what would happen:

image

 

The ”-v $PWD:/tmp/dockerwd” will map your current working directory to the “:/tmp/dockerwd” inside the container so Jupyter’s initial page will show the content of the folder you are at.

Running the code will create and start the container in detached mode:

 

image

 

It will also open the browser on your current working directory where you’ll probably only see the StartJupyter script you just ran. I’ve also manually added a “test.csv” file to test the load:

 

image

 

By creating a new notebook, you can see the libraries’ location (inside the container), see that the notebook is indeed working from /tmp/dockerwd and read files from your working directory (due to the mapping made) :

 

image

 

 

Now, since we started the container in detached mode (-d), we can connect to It using the exec command and by doing so, we can navigate to the /tmp/dockerwd folder, see that our files are there and even create a new file, which will of course be displayed on the Jupyter URL:

docker exec -it KagglePythonContainer bash

 

image

 

 

We can also see that we have 2 different processes running (ps –ef):

FYI: you will need to run

apt-get update && apt-get install procps 

to be able to run the ps command

 

image

 

 

At last, as mentioned before, the script will never start more than one container. So, if for some reason if you stop your container (if your VM is restarted for example), the script will just fire up the container named “KagglePythonContainer”, or it wont do anything if you call the script with the container already running. In both cases the Jupyter’s URL will always be displayed:

image

 

 

In order to understand docker, I highly recommend these tutorials:

https://training.docker.com/self-paced-training

Posted in Data Science, Docker, I.T., Python | Leave a Comment »

Dynamic Parameters on Tableau

Posted by Diego em Outubro 29, 2015

 

This is my solution to deal with Dynamic Parameters on Tableau.

Just to recap, the problem is related to the fact that, once a parameter list is populated on Tableau Desktop and the workbook is deployed to the server, is it not possible to update the list if new values are required.

 

I should start by saying that this is not a simple solution. 

I mean, the code actually is but unfortunately there is more to it, mainly because it will require you to republish the dashboard using tabcmd (or any other means) each time you want to change the parameter.

But that’s what I’ve got until Tableau decides to support this so requested functionality.

 

 

My idea is to use python to access the workbook’s XML, navigating the tree to the desired parameter (on my example called “MyParameter”), clear it’s values (except for the value “None” – this is my particular use case and can be removed very easily) and then inject the desired new values based on a list.

 

 

This is a before\after example – as expected, the list has been updated:

 

clip_image001clip_image003

 

 

 

 

And here is the core code:

def ParameterUpdate(envname, dashboardname):
    #the 5 lines bellow are optional. It can be used if the Source\Destination Dashboards are expected to be on a
    #different path than the one the python file is executing, which is often the case, but not for this example:
    configfile = resource_filename(__name__, 'tableau.ini')
    config = ConfigParser.ConfigParser()
    config.read(configfile)
    inputdir = config.get(envname, "InputDirectory")
    outputdir = config.get(envname, "OutputDirectory")

    #reads the twb file
    tableaufile = ElementTree.parse('{0}/{1}.twb'.format(inputdir, dashboardname))

    #gets a list with the new parameters
    new_parameter_list = __getNewParameterlist()

    #gets the root XML
    fileroot = tableaufile.getroot()

    #finds the parameter called "MyParameter"
    xmlsection = fileroot.find("datasources/datasource/column[@caption='{0}']/members".format("MyParameter"))

    #Inject the new list into the XML file
    __injectParameters(xmlsection, new_parameter_list)

    newxmlfile = ElementTree.tostring(fileroot, encoding='UTF-8', method='xml')

    filelocation = '{0}/{1}.twb'.format(outputdir, "Output")
    with open(filelocation, 'w') as text_file:
         text_file.write(newxmlfile)

 

 

 

Of course, this solution requires a whole messaging system between whatever process triggers the parameter update, the python code and the tab cmd that deploys the dashboard, which I won’t be covering here because it is out of this post’s scope.

 

The complete code and a working examples can be fully downloaded here.

 

 

OBS:

Due to the nature of ElementTree, this code only works on python 2.7 (Not sure about latter versions – I’d imagine it does; pretty sure it doesn’t work on 2.6)

 

OBS2:

I have been recently shown by one of the product managers on Tableau this other workaround using java script on the server:

https://www.youtube.com/watch?v=6GlNxEN1Guw

Which I’d honestly consider hadn’t I spent time writing my own solution.

Posted in Python, Tableau | Leave a Comment »

The Poisson distribution

Posted by Diego em Setembro 24, 2015

 

The Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event.

 

Example: let’s say that you are interested in knowing the probability that X cars  are going to pass in front your house in a given period of time.
To do that, you need to assume two things:

 

1)      Any period of time is no different than any other period of time;

2)      The number of cars that pass in one period doesn’t not affect the number of cars that will pass on the next period;

 

 

Give those assumptions, you need to find out how many cars on average pass in front of your house (that is your lambda  – λ).

If you want know the probability that K cars will pass, all you have to do is substitute λ and k on the formula bellow:

 

clip_image001[6]

 

 

For example, lest say that the average is 9 cars and you want to know the probability that exactly 2 cars will pass:

λ =9

k=2

 

81 /2  * 0.000123   =  0.004998097

 

Or 0.5%

 

 

That same calculation can be can be achieved in R using the density dpois function:

 

dpois(2,9)

[1] 0.004998097

 

 

 

To compute cumulative probabilities (for example, probability of 0 or 1 or 2 or 3 cars ), we can use the ppois function (which is the same as summing all individual probabilities with dpois):

 
ppois(3,9)
[1] 0.02122649

 

sum(dpois(0:3,9))
[1]0.02122649
 
 

Probability of having seventeen or more cars passing by your house:

ppois(16, lambda=9, lower=FALSE) #upper tail 
[1] 0.01110591
 

OR

 
1-ppois(16, lambda=9) #lower tail
[1] 0.01110591

 

 

 

To finalise, we can simulate how many cars will pass on the next 10 hours:

rpois(10,9)
[1]  9  9 11 15  7  8  8 13 11  9

Posted in R, Statistics | Leave a Comment »

Principal Component Analysis

Posted by Diego em Agosto 18, 2015

 

In this post I’ll be using R to run Principal Component Analysis on the same dataset used on this which is part of the  Udacity Data Analyst Nanodegree. The process I’ll follow is pretty much the same as the one on the webcast but I hope to be adding some extra information to it. I also would like to than Udacity, Sheng Kung and Nikolay for the amazing webcast and especially for the detailed answers to my questions on the forums.

 

1)      What is PCA and what is it good for?

 

·         Very efficient and automatic way of combining features in order to simplify a very large and complicated dataset;

·         If we have 2 features, we want to make sure that they are as independent as possible from each other; that each one contains the “purest” kind of information on their own;

·         It does that by trying to identify correlation between features by considering a new feature that is some sort of hybrid between others, to make the model simpler and easier to fit on the data;

·         One down side is that PCA are vectors over the initial feature so they can be harder to interpret than the initial features;

·         One of the assumptions on PCA is that the data is continuous and that it follows roughly a normal distribution. There are alternate techniques that handle other types of data;

 

2)      Data

As I said, I’ll be using the same dataset used on the website but to make the process easier I skipped the data cleaning process and am loading a “ready-to-go” dataset. You can find details about the dataset and the cleaning process on this ipython notebook. (I downloaded it from box 15 on the notebook).

The data set is pretty simple, it contains information about 105 countries around the globe which will all be considered as features for the sake of the PCA example.

 

df <- read.csv ("cleaned.csv")
head(df)

 

Country

completion_male

completion_female

income_per_person

employment

life_expectancy

Argentina

97.91143

102.10666

9.465967

58.4

75.5

Armenia

91.53127

94.58897

8.749288

39.4

72.2

Austria

100.63802

101.12819

10.656714

57.1

80

Azerbaijan

91.68856

90.66259

9.264091

59.3

69.9

Bahamas

99.76959

96.84352

10.133238

66.9

72.1

 

 

3)      Running PCA:

PCA can be run using the prcomp command.  The command offers the possibility of normalizing the data using the “center” and “scale” parameters (more info about normalizing can be found here). The reason you should always normalize the data is, imagine that you are trying to run PCA on 2 features A and B that range from, let’s say, A: 1 and 100 and B: 0 and 10. If you do that without normalizing the data, PCA is going to find a large principal component on the A variable so it is going to follow it rather than trying to find any correlation between the features.

Here I’m calling the prcomp command with center and scale = TRUE (If you omit these parameters you’ll see that the values outputted (weights) will be quite different):

 

prcomp(df[,2:6],center = TRUE,scale = TRUE)
Standard deviations:
[1] 1.8207722 0.9510142 0.7172067 0.4642723 0.2245580

Rotation:
                         PC1         PC2        PC3         PC4          PC5
completion_male   -0.5028374 -0.01732469  0.5155280  0.05640884 -0.691305540
completion_female -0.5088654 -0.01884488  0.4716004 -0.06446605  0.717034253
income_per_person -0.4686439  0.23200303 -0.4755441 -0.70319643 -0.076942116
employment         0.2264034  0.93671022  0.2659600 -0.02261103  0.008335095
life_expectancy   -0.4661813  0.26094689 -0.4636238  0.70545424  0.044373047

We’ll be looking at these values in a moment but to exemplify what the normalization is doing, I will manually normalise the initial dataset by subtracting each one of its value by the feature’s mean and dividing by its standard deviation:

 

  df$completion_male &lt;- ((df$completion_male - mean(df$completion_male)) / sd(df$completion_male))
  df$completion_female &lt;- ((df$completion_female - mean(df$completion_female)) / sd(df$completion_female))
  df$income_per_person &lt;- ((df$income_per_person - mean(df$income_per_person)) / sd(df$income_per_person))
  df$employment &lt;- ((df$employment - mean(df$employment)) / sd(df$employment))
  df$life_expectancy &lt;- ((df$life_expectancy - mean(df$life_expectancy)) / sd(df$life_expectancy))
  head(df)

Here’s how the Normalized data looks like:

Country

completion_male

completion_female

income_per_person

employment

life_expectancy

Argentina

0.500062073

0.685201535

0.30719635

0.016051348

0.63137587

Armenia

0.149531869

0.339307131

-0.272843386

-1.86760039

0.26952447

Austria

0.649862727

0.640181416

1.270919761

-0.11283039

1.1248096

Azerbaijan

0.158173486

0.158651521

0.143808985

0.105276729

0.01732501

Bahamas

0.60215061

0.443040622

0.847247396

0.85873765

0.25855927

 

Now, if I run PCA with the center and scale parameters = FALSE on the normalized data, you can see that I get the same result as the previous command.

 

pca&lt;- prcomp(df[,2:6],center = FALSE,scale = FALSE) 
pca


Standard deviations:
[1] 1.8207722 0.9510142 0.7172067 0.4642723 0.2245580

Rotation:
                         PC1         PC2        PC3         PC4          PC5
completion_male   -0.5028374 -0.01732469  0.5155280  0.05640884 -0.691305540
completion_female -0.5088654 -0.01884488  0.4716004 -0.06446605  0.717034253
income_per_person -0.4686439  0.23200303 -0.4755441 -0.70319643 -0.076942116
employment         0.2264034  0.93671022  0.2659600 -0.02261103  0.008335095
life_expectancy   -0.4661813  0.26094689 -0.4636238  0.70545424  0.044373047



4)      Principal Components:

Each component corresponds to the coefficients (weights) to be applied to each feature (the data points) in order to get to the new resulting point. PCA can come up with a number of orthogonal components that is equal to the number of features we fit, so if we have 5 features on the model, the maximum number of principal components is 5.  By running a summary on the pca object, we can see that the first PCA explain 66% of the data’s variance, the second 18% and so on:

 

summary(pca)
Importance of components:
                         PC1    PC2    PC3     PC4     PC5
Standard deviation     1.821 0.9510 0.7172 0.46427 0.22456
Proportion of Variance 0.663 0.1809 0.1029 0.04311 0.01009
Cumulative Proportion  0.663 0.8439 0.9468 0.98991 1.00000

 

The prcomp function offers a “tol” parameter that can be used to limit the number of component to be returned. The value indicates the magnitude below which components should be omitted. (Components are omitted if their standard deviations are less than or equal to “tol” times the standard deviation of the first component.) . For example, had we called the function above with tol =0.4, we’d have gotten only 2 principal components because 1.821 * 0.5 = 0.9105 and PC3’s standard deviation is 0.7172

The principal components can be found by running pca[2] which will output a matrix containing the features as rows and the principal components as columns:

pca[2] #as.data.frame(components$rotation)

 

row.names

PC1

PC2

PC3

PC4

PC5

completion_male

-0.5028374

-0.01732469

0.515528

0.05640884

-0.69130554

completion_female

-0.5088654

-0.01884488

0.4716004

-0.06446605

0.717034253

income_per_person

-0.4686439

0.23200303

-0.4755441

-0.70319643

-0.07694212

employment

0.2264034

0.93671022

0.26596

-0.02261103

0.008335095

life_expectancy

-0.4661813

0.26094689

-0.4636238

0.70545424

0.044373047

 

Note that all 5 principal components are generated from all five features (principal components are a linear combination over all features). If we had requested prcomp to generate a different number or pcs the order of them would be the same since the data isn’t actually changing. For example, PC1 will always be (-0.5028374, -0.5088654 and so on…) since those vectors should still be the directions of maximal variance on the five features we are investigating.

Even though each principal component is generated from all five features, one thing that you might find from your principal component transformation is that there may be certain features that are highly correlated with one another and you might see a principal component that has a very large weight on a handful of features.

Features with similar coefficients in the same direction (both positive or both negative) are obviously similar, but features of the same magnitude but opposite directions are also interesting for their negative correlations. In terms of overall magnitude, coefficients do not need to all be large, only similar.

The overall length of the principal component vector is 1 (the sum of each weight squared), so if we observe a coefficient that has a large magnitude, then we know that most of the weight of that principal component is on that single feature and the rest of the weights will be relatively small. You’ll tend to see correlated features have smaller coefficient values, due to the need to ‘share’ weight on each of the features.

In this case, we can see that the second principal component represents the variation contributed by the “employment” feature.

 

Another interesting feature of the pca object is X (fifth feature on the pca list), which contains the transformed data (the original data with PCA weights applied):

as.data.frame(pca$x[1:5,])

 

PC1

PC2

PC3

PC4

PC5

-1.0348

0.2295

0.1464

0.2131

0.1501

-0.6685

-1.7514

-0.2548

0.4108

0.1573

-1.7981

0.4594

-0.5189

-0.1023

-0.0390

-0.2119

0.1308

0.1079

-0.0926

-0.0050

-0.8514

1.0496

0.2250

-0.4274

-0.1452

 

So, for example, if we get the first row of our ds dataset, Argentina, (remember that it contains the normalized values because I did it manually) and do the matrix multiplication between its values and the first principal component, we’ll get cell pca$x[1,1]:

country

completion_male

completion_female

income_per_person

employment

life_expectancy

Argentina

0.5000621

0.6852015

0.3071964

0.0160513

0.6313759

PC1

-0.5028374

-0.5088654

-0.4686439

0.2264034

-0.4661813

0.500062073   * 0.5028374 +

0.685201535   * 0.5088654 +

0.30719635    * 0.4686439 +

0.016051348   * 0.2264034  +

0.63137587   * 0.4661813 = 1.0348

 

 

5)      PCA’s Properties:

The resulting components from PCA are independent from each other. The information contained in one doesn’t tell anything about the value that is on the other features. So when we fit it to a model, each of these features stands on its own by telling something different between the relationship between our predictors and out outcome. Also, the components are orthogonal to each other and normalized to have length 1:

components <- pca[2]
#Components are orthogonal to each other (dot product = 0)
components$rotation[,3] %*% components$rotation[,1]
             [,1]
[1,] 1.387779e-16
#Components are normaled to have length 1.
sum(components$rotation[,1]**2)
[1] 1

 

 

6)      Plotting:

If we plot this in the principal component space (where each dot represents a country plotted with its 2 PCAs – let’s use PC1 and PC2), with an arrow pointing from the origin to the point implied by the row’s coordinates, we can see how well each feature aligns with each principal component. If we see an arrow aligned strongly with one of the axes, then we can say that the principal component on that axis may represent the feature fairly well. (For example, the fourth row has a very strong alignment with the second principal component.)

We can also see, if multiple vectors point in the same direction, how much similarity there is between features. (This can be seen in the first and second rows of the array, as well as the third and fifth rows.)

pca1<-components$rotation[,1]
pca2<-components$rotation[,2]
 
plot(pca$x[,1],  pca$x[,2])
arrow_size=2
arrows(0,0,pca1[1]* arrow_size,pca2[1]* arrow_size) 
arrows(0,0,pca1[2]* arrow_size,pca2[2]* arrow_size)
arrows(0,0,pca1[3]* arrow_size,pca2[3]* arrow_size)
arrows(0,0,pca1[4]* arrow_size,pca2[4]* arrow_size)
text(pca1[4]* arrow_size,pca2[4]* arrow_size, row.names(pca$rotation)[4])
arrows(0,0,pca1[5]* arrow_size,pca2[5]* arrow_size)

 

 

 

completion_male

completion_female

income_per_person

employment

life_expectancy

PC1

-0.50283735

-0.50886538

-0.4686439

0.2264034

-0.4661813

PC2

-0.01732469

-0.01884488

0.232003

0.9367102

0.2609469

 

image

 

7)    Extra points on Interpreting the results:

The explained variance on each principal component can be very useful on the interpretation. Since there is a good amount of variance explained by both the first and second principal component, we can take a look at both of these components together when trying to see if there are features that go well with one another. If the first component had a higher amount of explained variance and the second component a smaller amount of explained variance, then we might well believe that there is only one major signal in the data that is covered by all of the components except for the fourth. With the data as it is, the second component is important enough that we might be better-suited to consider the first and second features as one pair of related features, and the third and fifth as a second pair.

If we see this behaviour in our principal components, then we can actually go back and create hybrid features and then substitute them in for our original features in the original feature space. When it comes to creating composite features based on the results of PCA, it certainly helps to have domain knowledge. And if you don’t have that knowledge, it is a good opportunity to do some additional exploration and research to see if you can understand the underlying relationship between features. If you have documentation for your dataset, look there; performing additional visualizations can also be a big help to understanding the data. Data analysis is a process where you do need to move between steps in an iterative fashion, from wrangling to exploration to modelling.

As for selecting a good number of principal components to reduce down to, you can look at the amount of variance explained by each component. If you start out with a large number of components, then you can see the importance of each component on explaining variation in the original data. Once the values get very small, you can make an assumption that the components are beginning to fit to the random noise in the data rather than the true signal. If you look at the size of the explained variance values or the trend in the cumulative sum of explained variance coefficients (visualizing these in plots will be quite useful), this can help you make a judgement on how many components you want to include in your algorithm.

Interpretation of the output of principal component analysis is quite tricky since there are a lot of moving parts and principal components may not always be easily interpretable. I don’t think there’s any hard, deterministic rules to follow; one of the most important skills that you can have is to be able to synthesize multiple bits of information together and make your own conclusions.

However, it’s also important to realize in practice, it may not be possible to see such clean correlations and relationships between features. If we have a lot of features, it may be difficult to find a coherent interpretation for each principal component. This is one of the downsides of PCA that it may be a good dimensionality reduction technique, but it certainly has the possibility of returning un-interpretable components.

Posted in PCA, Statistics | Leave a Comment »

How to measure table space on disk in ParAccel

Posted by Diego em Julho 31, 2015

Special attention to the value 262144 which represents the block size in Bytes.
The default is 1 Mb but If the block size is different (in my case for example is 256K), you have to multiply the number of blocks by its size in bytes.

 

select 
c.oid as &quot;Table ID&quot;, 
s.oid as &quot;Schema ID&quot;,
trim(c.relname) as &quot;Name&quot;, 
trim(s.nspname) as &quot;Schema&quot;, 
trim(u.usename) as &quot;Owner&quot;, 
coalesce(c.reltuples, 0)::bigint as &quot;Rows&quot;, 
coalesce(t.blocks * 262144, 0)::bigint/(1024*1024) as &quot;Total MBytes&quot;, 
coalesce(t.blocks, 0) as &quot;Total Blocks&quot;,
c.relnatts as &quot;Columns&quot;, 
current_database() as db 
from 	pg_catalog.pg_class c 
		join pg_catalog.pg_namespace s on s.oid = c.relnamespace 
		join pg_catalog.pg_user u on u.usesysid = c.relowner 
		left join ( 
				select p.id, p.db_id, 
				sum(coalesce(p.rows, 0)) as rows, 
				sum(coalesce(p.sorted_rows, 0)) as sorted_rows, 
				sum(coalesce(p.temp, 0)) as temp, 
				min(coalesce(p.rows, 0)) as min_rows, 
				max(coalesce(p.rows, 0)) as max_rows,
				sum(b.blocknum) as blocks 
				from (select distinct slice from pg_catalog.stv_blocklist) bl 
					 left join pg_catalog.stv_tbl_perm p on p.slice = bl.slice 
					 left join pg_catalog.pg_database d on d.datname = current_database() 
					 left join 
					 ( 
						 select bl.tbl, bl.slice, count(bl.blocknum) as blocknum 
						 from pg_catalog.stv_blocklist bl 
						 group by bl.tbl, bl.slice 
					 ) b on b.tbl = p.id and b.slice = p.slice group by p.id, p.db_id 
		) t on t.id = c.oid 
where c.relkind = 'r' 
and nspname not in ('pg_catalog', 'information_schema', 'pg_toast', 'pg_bitmapindex', 'pg_internal', 'pg_aoseg') 
order by c.relname, s.nspname
				                   

 

 

By Column:

“Cols” refers to a Zero-based index for the column. Every table you create has three hidden columns appended to it: INSERT_XID, DELETE_XID, and ROW_ID (OID).

A table with 3 user-defined columns contains 6 actual columns, and the user-defined columns are internally numbered as 0, 1, and 2. The INSERT_XID, DELETE_XID, and ROW_ID columns are numbered 3, 4, and 5, respectively, in this example.

 

select col, count(*) as num_blocks,count(*) * 262144::bigint/(1024*1024) as &quot;Total MBytes&quot;
from stv_blocklist, stv_tbl_perm
where stv_blocklist.tbl = stv_tbl_perm.id
and stv_blocklist.slice = stv_tbl_perm.slice
and stv_tbl_perm.name = 'occurencecharts_results2'
and col IN (20, 21)
group by col
order by col;

Posted in Actian Matrix \ ParAccel, Big Data, I.T. | Leave a Comment »

Twitter Data Analysis

Posted by Diego em Julho 29, 2015

With the recent Windows 10 release I decided to bring back to life an old python code that I wrote over an year ago that analyses twitter data. I was interested in seeing how people are reacting to the release.

The code is divided in two parts, the “GetTwitterData.py” that, as the name implies, fetches data from twitter and the “Sentiment.py” that analyses text for positive\negative sentiments.

 

1.Get the data

To access the Twitter API its necessary to create an application at (https://dev.twitter.com/apps). I won’t go through details on how to do that because there are plenty of tutorials online.

The code to get the data has a lot of comments in it so shouldn’t be hard to understand it. One thing to note is that there is some level of “hard coding” in it. For example, the “search term” (in this case, Microsoft) and the “mode” are set on the file itself.

Regarding the “mode”, there are 2 ways the function can work, “topN” or “live track”. The first one will output the latest N tweets on a particular subject while the second will output tweets as they are being generated in real-time (you will need to interrupt the program to make it stop).

Code:

 

#Used to fetch live stream data from twitter.
#To get credentials: "https://dev.twitter.com/apps"

import oauth2 as oauth
import urllib2 as urllib
import json
from pprint import pprint

api_key = "XXXXX"
api_secret = "XXXXX"
access_token_key = "XXXXX-XXXXX"
access_token_secret = "XXXXX"



_debug = 0

oauth_token    = oauth.Token(key=access_token_key, secret=access_token_secret)
oauth_consumer = oauth.Consumer(key=api_key, secret=api_secret)

signature_method_hmac_sha1 = oauth.SignatureMethod_HMAC_SHA1()

http_method = "GET"


http_handler  = urllib.HTTPHandler(debuglevel=_debug)
https_handler = urllib.HTTPSHandler(debuglevel=_debug)

def twitter_track(url, method, parameters):
    req = oauth.Request.from_consumer_and_token(oauth_consumer,
                                             token=oauth_token,
                                             http_method=http_method,
                                             http_url=url,
                                             parameters=parameters)

    req.sign_request(signature_method_hmac_sha1, oauth_consumer, oauth_token)
    headers = req.to_header()

    if http_method == "POST":
        encoded_post_data = req.to_postdata()
    else:
        encoded_post_data = None
    url = req.to_url()

    opener = urllib.OpenerDirector()
    opener.add_handler(http_handler)
    opener.add_handler(https_handler)
    response = opener.open(url, encoded_post_data)

    return response


def getData(mode, topn):
    parameters = []

    #returns an infinite stream of tweets, hence the need to ^C to break out of the for loop
    #use the first URL to get all sort of tweets
    if mode=='live track':
        #url = "https://stream.twitter.com/1/statuses/sample.json" 
        url = "https://stream.twitter.com/1.1/statuses/filter.json?track=Microsoft" #track one subject
        response = twitter_track(url, "GET", parameters)
        for line in response:
            text = line.strip()
            #line is a string so Im doing some very basic (and error prone) string manipulation - room for improvement here
            s= str.find(text,"text")
            e =str.find(text,"source")
            print text[s+7:e-3]
            print ""


    elif mode=="topN":#will return TOP N tweets on the subject
        tweet_count = '&amp;count='+str(topn)    # tweets/page
        queryparams = '?q=Microsoft&amp;lang=en'+tweet_count
        url = "https://api.twitter.com/1.1/search/tweets.json" + queryparams

        #Ignoring the "parameters" variable - quite easy to use the URL
        response = twitter_track(url, "GET", parameters)
        data = json.load(response)#contains all N tweets
        #pprint(data) # data is a dictionary
        for tweet in data["statuses"]:
            print tweet["text"]

#TO DO:
#search term is hardcoded - parametrize it
#clean unnecessary characters, ex: URLS are coming like: http:\/\/t.co\/RC1Z7IaMu5
if __name__ == '__main__':
    #Options:
    #live track: Track Function where all tweets or a single search criteria can be tracked in real-time
    #            Tweets do not repeat, second parameter ignored
    #topN: Displays last N tweets on the subject
  getData("live track",10)

 

Here is an example of the output:

image

And here is how I called the code to pipe the output to a text file – that will be used on the sentiment analysis:

image

 

 

2.Analyse it

The analyse function takes 2 arguments, the first is a link to a “sentiment” file and the second the file we want to analyse. The sentiment file I’m using is called AFINN-111. AFINN is a list of English words rated for valence with an integer between minus five (negative) and plus five (positive). The words have been manually labelled by Finn Årup Nielsen in 2009-2011. The file is tab-separated and contains2477 words.  The code read both files, create a dictionary with word – value pairs from the AFINN file and loops trough the lines on the sentiment file looking up each word on the dictionary just created and summing their “value” on the tweet. (a very simplistic word – by –word approach).

 

import sys
import json
import operator

def main():

    if len(sys.argv) == 2:
        sent_file = open(sys.argv[1])
        tweet_file = open(sys.argv[2])
    else:
        sent_file = open("AFINN-111.txt")
        tweet_file = open("microsoft.txt")


    #load dictionary
    scores = {}
    for line in sent_file:
        term, score  = line.split("\t")  # The file is tab-delimited.
        scores[term] = int(score)

    #missing_scores = {} #not using this at the moment
    ranked_tweets = {}

    for line in tweet_file:
        line= line.strip()
        if line=="":#ignore blank lines
            next

        #print line
        tweet_score = 0
        try:
            words = line.split()

            for word in words:
                tweet_score += scores.get(word, 0)

            if tweet_score !=0:
                ranked_tweets[line] = tweet_score

        except KeyError:
            continue

    print "Number of tweets scored: "+str(len(ranked_tweets))
    d = dict((k, v) for k, v in ranked_tweets.items() if v &gt; 0)
    print "    Positive Tweets:: "+str(len(d))
    d = dict((k, v) for k, v in ranked_tweets.items() if v &lt; 0)
    print "    Negative Tweets:: "+str(len(d))

    print ""
    print ""

    print "Top 10 Best tweets: "
    for key, value in sorted(ranked_tweets.iteritems(), key=lambda (k,v): (v,k), reverse=True)[0:9]:
        print "   %s: %s" % (key, value)

    print " "
    print " "

    print "Top 10 Worst tweets: "
    for key, value in sorted(ranked_tweets.iteritems(), key=lambda (k,v): (v,k))[0:9]:
        print "   %s: %s" % (key, value)


    #Print all disctionary
    #for key, value in sorted_x.iteritems() :
        #print key, value


if __name__ == '__main__':
    main()

 

I left the program running for over an hour and collected around 20.000 tweets, of those 3.492 where scored.  One important thing to note (and I just realized that when I saw how low that number was) is that I forgot to fetch only “English” tweets on the “live track” option of the “GetTwitterData” therefore a lot of tweets were just ignored on this step rather than on the first step because the sentiment file is in only English., which explains the low classification rate.

The result is quite interesting, from the 3.492 tweets scored 3667 are positives and 825 negatives, which possibly indicates a good acceptance from the community.

Bellow I print the 10 best and worst tweets (please mind the language on the negative ones)

 

toptweets

 

 Things to consider (improve):

·         I see a lot of re-tweets – maybe I shouldn’t consider them? Or if you are re-tweeting something that someone else liked\hated does it mean you like\hate it too?

·         Only consider tweets in English (this is more for the first part of the program)

·         Should I consider a different way to decide whether a tweet was scored or not? Tweet_score “zero” can either mean that no words were scored or that bad words cancelled good words. Is there a difference?

 

 

 

note: thanks to this post on how to post python code

Posted in Python, Twitter | Leave a Comment »