Um blog sobre nada

Um conjunto de inutilidades que podem vir a ser úteis

AWS S3 – Copying objects between AWS accounts (TO and FROM)

Posted by Diego on December 21, 2020

There are two possible situation where you’d want to move S3 objects between different AWS accounts. You could be trying to copy an object FROM a different AWS account to your account, or you could be trying to copy an object that resides on your account TO a different AWS account. In both cases the approach is similar but slightly different.

OPTION 1 – Copy FROM another account
(you are on the destination account and want to copy the data from a source account)

First, add this POLICY to the source bucket:

  • DESTACCOUNT is the destination account ID
  • SOURCEACCOUNT: is the source account ID
  • YOURROLE: the role on the destination account that is performing the copy
  • SOURCEBUCKET is the name of the bucket where the data is
  • DESTINATIONBUCKET is the name of the bucket you want to copy the data to
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "DelegateS3Access",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:sts::DESTACCOUNT:assumed-role/YOURROLE"
            },
            "Action": [
                "s3:ListBucket",
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::SOURCEBUCKET/*",
                "arn:aws:s3:::SOURCEBUCKET"
            ]
        }
    ]
}

Second, add this policy to the role that will perform the copy (YOURROLE):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::SOURCEBUCKET",
                "arn:aws:s3:::SOURCEBUCKET/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:PutObject",
                "s3:PutObjectAcl"
            ],
            "Resource": [
                "arn:aws:s3:::DESTINATIONBUCKET",
                "arn:aws:s3:::DESTINATIONBUCKET/*"
            ]
        }
    ]
}

OPTION 2 – Copy TO another account
(for example, a lambda function copies the data from the account it runs to a different account)

First, add this policy to the destination bucket on the Destination Account

{
    "Version": "2012-10-17",
    "Id": "MyPolicyID",
    "Statement": [
        {
            "Sid": "mySid",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::SOURCEACCOUNT:root"
            },
            "Action": "s3:PutObject",
            "Resource": [
                "arn:aws:s3:::DESTINATIONBUCKET/*",
                "arn:aws:s3:::DESTINATIONBUCKET"
            ]
        }
    ]
}

Second, add the exact same policy mentioned on the “second” item above to the role that will perform the copy (YOURROLE);

IMPORTANT: Object Ownership

If you are copying from account A TO account B (a lambda running on A for example), the objects on account B be will be owned by the user that performed the copy on account A. That may (definitely will) cause problems on account B, so make sure to add the “bucket-owner-full-control” ACL when copying the object. For example:

s3 = boto3.resource('s3')    
copy_source = {
    'Bucket': 'sourceBucket',
    'Key': 'sourceKey'
}
bucket = s3.Bucket('destBucket')
extra_args = {'ACL': 'bucket-owner-full-control'}
bucket.copy(copy_source, 'destKey', extra_args)

Posted in AWS, S3 | Leave a Comment »

How to Install Maven 

Posted by Diego on January 18, 2023

1. Install Latest JDK

For Ubuntu:

sudo apt install default-jdk -y

For Amazon Linux,

sudo yum install java-17-amazon-corretto-devel -y

For Redhat Linux,

sudo yum install java-17-openjdk -y

Verify Java JDK installation using the following command.

java -version

2. Install Maven (Latest From Maven Repo)

Step 1: Go to Maven Downloads and down the latest package.

wget https://dlcdn.apache.org/maven/maven-3/3.8.6/binaries/apache-maven-3.8.6-bin.tar.gz

Step 2: Untar the mvn package to the /opt folder.

sudo tar xvf apache-maven-3.8.6-bin.tar.gz -C /opt

Step 3: Create a symbolic link to the maven folder. This way, when you have a new version of maven, you just have to update the symbolic link and the path variables remain the same.

sudo ln -s /opt/apache-maven-3.8.6 /opt/maven

3. Add Maven Folder To System PATH

To access the mvn command systemwide, you need to either set the M2_HOME environment variable or add /opt/maven to the system PATH.

We will do both by adding them to the profile.d folder. So that every time the shell starts, it gets sourced and the mvn command will be available system-wide.

Step 1: Create a script file named maven.sh in the profile.d folder.

sudo vi /etc/profile.d/maven.sh

Step 2: Add the following to the script and save the file.

export M2_HOME=/opt/maven
export PATH=${M2_HOME}/bin:${PATH}

Step 3: Add execute permission to the maven.sh script.

sudo chmod +x /etc/profile.d/maven.sh

Step 4: Source the script for changes to take immediate effect.

source /etc/profile.d/maven.sh

Step 5: Verify maven installation

mvn -version

Posted in I.T. | Leave a Comment »

Glue Streaming – Error in SQL statement: AnalysisException: Table or view not found

Posted by Diego on December 2, 2022

How to fix the AnalysisException SQL error “Table or view not found”.

Problem:

On a Glue spark Streaming job, the code below fails with error message above

        data_frame.createOrReplaceTempView("footable")
        spark.sql("select * from footable").show()

Explanation:

A streaming query uses its own SparkSession which is cloned from the SparkSession that starts the query. And the DataFrame provided by foreachBatch is created from the streaming query’s SparkSession. Hence you cannot access temp views using the original SparkSession.

Solution:

Create a Global temp view:

        data_frame.createOrReplaceGlobalTempView("footable")
        spark.sql("select * from global_temp.footable").show()

Posted in Uncategorized | Tagged: , , | Leave a Comment »

AWS Athena – Visualising usage and cost

Posted by Diego on June 10, 2021

This is a placeholder for this article:

https://aws.amazon.com/blogs/big-data/auditing-inspecting-and-visualizing-amazon-athena-usage-and-cost/

a fully working solution (CFN template) to analyze, amongst other things, how much data each query submitted to Athena scanned.

Posted in Athena, AWS | Leave a Comment »

AWS Lambda – Pandas +pyarrow on a Layer

Posted by Diego on January 12, 2021

Here I show how to create an AWS layer with pandas as pyarrow so you can use it to convert CSV files to parquet for testing purposes. I say “testing” because, on a production scenario, I do not recommend performing this operation on lambda because of its memory limitations . Even if your CSV files are small, chances are that in the future they will grow and break your process. Lambda is, however, a good place for testing the process.

This process needs to be run on UNIX or MAC

1. Create a working folder and install libraries

mkdir folder
cd folder
virtualenv v-env --python=/usr/bin/python3
source ./v-env/bin/activate
pip install pandas
pip isntall pyarrow
deactivate

2. Create a new folder with only what we need (also in the format expected by the layer)

  • Be careful with the version of python (we just specify python 3 above, so it will get the latest version) and wether it installed on lib or lib 64
  • Also, if the file end up be being too big to upload via the CLI, you can use a tool called “cleanpy” to remove unnecessary python files.
mkdir pandaslayer
cd pandaslayer
mkdir python
cd python
cp -r ../../../folder/v-env/lib/python3.8/site-packages/* .
cd ..
zip -r pandaslayer.zip python

3. Publish the layer

aws lambda publish-layer-version --layer-name MyLayer --zip-file fileb://pandaslayer.zip --compatible-runtimes python3.7

Posted in Uncategorized | Leave a Comment »

Python – Integration with Google Sheets

Posted by Diego on August 24, 2020

Setting things up:

  1. Go to console.cloud.google.com/
  2. Create a new project
  3. Search for google sheets API
  4. Click “Enable API”
    1. Do the same for the google drive API
  5. On the google sheets API -> create credentials
    1. Click on the “Service Account” link
      1. Add a name and keep clicking next (I haven’t selected any of the “optional” stuff”)
  6. Click on the Service Account created
  7. Scroll down to “keys” -> create key
    1. Select JSON

Integration:

The credentials file has a “client_email” field. You need to share the spreadsheet you want to query with that email.

Python:

pip install gspread_dataframe gspread oauth2client

import gspread
from gspread_dataframe import (get_as_dataframe, set_with_dataframe)
from oauth2client.service_account import ServiceAccountCredentials
import pandas as pd

FILE_KEY = "YOURFILEKEY"
SHEET_NAME = "SheetName"

def _get_worksheet(key,  worksheet_name, creds) -> gspread.Worksheet:
    scope = ["https://spreadsheets.google.com/feeds",
             "https://www.googleapis.com/auth/drive"]
    credentials = ServiceAccountCredentials.from_json_keyfile_name(creds, scope)
    gc = gspread.authorize(credentials)
    wb = gc.open_by_key(key)
    sheet = wb.worksheet(worksheet_name)
    return sheet

def write(sheet: gspread.Worksheet, df: pd.DataFrame, **options) -> None:
    set_with_dataframe(sheet, df,
                     include_index=False,
                     resize=True,
                     **options)
    
def read(sheet: gspread.Worksheet, **options) -> pd.DataFrame:
    return get_as_dataframe(sheet,
                     evaluate_formulas=True,
                     **options)


sh =  _get_worksheet(FILE_KEY, SHEET_NAME, "./credentials.json" )
df = read(sh)

FYI: the file key if the sheet’s unique identifier. The part from the URL after “https://docs.google.com/spreadsheets/d/”

Posted in Python | Leave a Comment »

AWS S3 – ListObjectsV2 operation: Access Denied

Posted by Diego on May 27, 2020

Every now and then I fall for this. Upon getting a:

botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied

error, my first course of action is to try to add that exact permission to whatever role I need (something like: s3:ListObjectsV2)

That wont work (despite the fact that the cloudformation template will run just just fine!!) The correct permission to be given is: ‘s3:ListBucket‘, which is nevessary for the following operations:

  • GetObject
  • PutObject
  • CreateBucket

Posted in AWS, S3 | Leave a Comment »

AWS Lambda and VPCs – changes Sep 2019

Posted by Diego on May 12, 2020

This is just a placeholder for an excellent AWS artiles that explains changes in how lambdas are deployed on VPCs and why are lambda taking loner to be deployed on a VPC but as a result, have a faster start time.

https://aws.amazon.com/blogs/compute/announcing-improved-vpc-networking-for-aws-lambda-functions/

Posted in Uncategorized | Leave a Comment »

Changing windows password without CTR + ALT + DEL

Posted by Diego on April 29, 2020

We’ve all been there: trying to change your passwrod while on a RDP sessions. CTRL + ALT + DEL doesn’t work. The virtual keyboard doesnt work, the “accounts” settings tell you to issue a CTRL + ALT + DEL….. so PowerShell to the rescue:

$AccountName = '<Your Account>'
 
$current = Read-Host -asSecureString "Enter the current password"
$newpw = Read-Host -asSecureString "Enter the new password"
 
Set-AdAccountPassword -Identity $AccountName -OldPassword $current -NewPassword $newpw

Posted in DevOps | Leave a Comment »

Docker + Windows – error: "exited with code 127"

Posted by Diego on March 25, 2020

Problem: on Windos, when cloning a repo from git that has bash files, you may get the above error in one of the bash files.

Explanation: It turns out that windows and git have a problem with “end of line” characters as a cloned repository will contain crlf line breaks, instead of lf (more info here).

Solution: clone your repo with the “–config core.autocrlf=input” option.

Posted in DevOps | Leave a Comment »