This is a placeholder for this article:
a fully working solution (CFN template) to analyze, amongst other things, how much data each query submitted to Athena scanned.
Posted by Diego on June 10, 2021
This is a placeholder for this article:
a fully working solution (CFN template) to analyze, amongst other things, how much data each query submitted to Athena scanned.
Posted in Athena, AWS | Leave a Comment »
Posted by Diego on December 21, 2020
There are two possible situation where you’d want to move S3 objects between different AWS accounts. You could be trying to copy an object FROM a different AWS account to your account, or you could be trying to copy an object that resides on your account TO a different AWS account. In both cases the approach is similar but slightly different.
OPTION 1 – Copy FROM another account
(you are on the destination account and want to copy the data from a source account)
First, add this POLICY to the source bucket:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DelegateS3Access",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:sts::DESTACCOUNT:assumed-role/YOURROLE"
},
"Action": [
"s3:ListBucket",
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::SOURCEBUCKET/*",
"arn:aws:s3:::SOURCEBUCKET"
]
}
]
}
Second, add this policy to the role that will perform the copy (YOURROLE):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::SOURCEBUCKET",
"arn:aws:s3:::SOURCEBUCKET/*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:PutObject",
"s3:PutObjectAcl"
],
"Resource": [
"arn:aws:s3:::DESTINATIONBUCKET",
"arn:aws:s3:::DESTINATIONBUCKET/*"
]
}
]
}
OPTION 2 – Copy TO another account
(for example, a lambda function copies the data from the account it runs to a different account)
First, add this policy to the destination bucket on the Destination Account
{
"Version": "2012-10-17",
"Id": "MyPolicyID",
"Statement": [
{
"Sid": "mySid",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::SOURCEACCOUNT:root"
},
"Action": "s3:PutObject",
"Resource": [
"arn:aws:s3:::DESTINATIONBUCKET/*",
"arn:aws:s3:::DESTINATIONBUCKET"
]
}
]
}
Second, add the exact same policy mentioned on the “second” item above to the role that will perform the copy (YOURROLE);
IMPORTANT: Object Ownership
If you are copying from account A TO account B (a lambda running on A for example), the objects on account B be will be owned by the user that performed the copy on account A. That may (definitely will) cause problems on account B, so make sure to add the “bucket-owner-full-control” ACL when copying the object. For example:
s3 = boto3.resource('s3')
copy_source = {
'Bucket': 'sourceBucket',
'Key': 'sourceKey'
}
bucket = s3.Bucket('destBucket')
extra_args = {'ACL': 'bucket-owner-full-control'}
bucket.copy(copy_source, 'destKey', extra_args)
Posted in AWS, S3 | Leave a Comment »
Posted by Diego on May 27, 2020
Every now and then I fall for this. Upon getting a:
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied
error, my first course of action is to try to add that exact permission to whatever role I need (something like: s3:ListObjectsV2)
That wont work (despite the fact that the cloudformation template will run just just fine!!) The correct permission to be given is: ‘s3:ListBucket‘, which is nevessary for the following operations:
Posted in AWS, S3 | Leave a Comment »
Posted by Diego on February 26, 2020
Code Pipeline notification can be created very easily trough the AWS console, but in the spirit of 100% automation, I was trying to create them using CloudFormation. Very quicky I found out that this is a “AWS::CodeStarNotifications::NotificationRule” functionality, a new..ish AWS service that is not very well documented.
First thing you’ll notice is that you need to inform which “EventTypeIds” you want the pipeline to notify. This are the events you can select here:
At the time of writing, these events were not documented anywhere. I went through the CloudFormation documentation and I could see that the allowed values are not documented neither any reference has been provided for where this can be found.
So I reached out to AWS and their support acknowledged that this info was missing from their docs and provided me with the list below (which will probably be reflected in their docs pretty soon):
Sample template:
Posted in AWS, CodePipeline | Leave a Comment »
Posted by Diego on February 21, 2020
Objective: Create a SNS topic subscription to a lambda function (when something publishes to the topic, we want to run a lambda function)
If we do it manually trough the console, it works just fine.
After creating the subscription, AWS will automatically add a trigger to the lambda function, which will allow the topic to invoke the lambda
Here, for example, on the “LambdaTest” topic, I created a subscription to the “test” lambda, and this is what I see on the lambda:
Problem: that will not happen if we create the topic + subscription using CloudFormation as AWS won’t create the trigger we see on the left.
“Common sense” would say that, you can create the lambda and the topic on CloudFormation (something like this):
and AWS will create the trigger automatically as well (like it does from the console) – but that is not the case.
You need to create the trigger yourself as well – which kind of creates a “chicken and egg” situation because the topic needs to point to the lambda (as a subscription) and the lambda needs a trigger (EventSource) needs to point to the topic.
Fortunately (or not – who knows?) from cloudformation you can create SNS subscription to lambdas that don’t yet exist (only the console enforces an existing lambda by throwing a “ResourceNotFoundException” error message).
Alternatively, you can add an “AWS::Lambda::Permission” to your fucntion, which allows the SNS Topic to call the Lambda Function. These are called “Resource-based policy” and enable you to grant usage permission to other accounts on a per-resource basis. You also use a resource-based policy to allow an AWS service to invoke your function.
Posted in AWS, SNS | Leave a Comment »
Posted by Diego on November 20, 2019
AWS recently release the functionaly of setting up notification for CodePipeline using SNS.
“You can now receive notifications about events in repositories, build projects, deployments, and pipelines when you use AWS CodeCommit, AWS CodeBuild, AWS CodeDeploy, and/or AWS CodePipeline. Notifications will come in the form of Amazon SNS notifications. Each notification will include a status message as well as a link to the resources whose event generated that notification.”
When I tested the functionalty the first time, I created the SNS topic using the console (during the Notification Rule creation) and everything worked as expected.
After the test, I decided to create the resources (specially the SNS topic) using cloud formation and I noticed that the notification weren’t being published to the topic anymore.
After some research I found this on the AWS documentation:
“If you want to use an existing Amazon SNS topic instead of creating a new one, in Targets, choose its ARN. Make sure the topic has the appropriate access policy,….”
And indeed I realised that, when the topic was being created by the console, it added permission to “codestar” to publish to the topic…something that I never imagined necessary, because I didn’t know codestar was part of the equation.
In CloudFormation words, what I needed to do was something liem this:
FYI: the __default_statement_ID Sid is created automatically by cloudformation if you don’t specify a “TopicPolicy”. Since we are adding the “codestar” permission, we need to add the default statement (if, of course you actually need those permissions)
PipelineNotificationTopic:
Type: AWS::SNS::Topic
Properties:
DisplayName: MyTopicDisplayName
TopicName: MyTopicName
PipelineNotificationTopicPolicy:
Type: AWS::SNS::TopicPolicy
Properties:
PolicyDocument:
Version: '2008-10-17'
Statement:
- Sid: CodeNotification_publish
Effect: Allow
Principal:
Service: codestar-notifications.amazonaws.com
Action: SNS:Publish
Resource: !Ref PipelineNotificationTopic
- Sid: __default_statement_ID
Effect: Allow
Principal:
AWS: "*"
Action:
- SNS:GetTopicAttributes
- SNS:SetTopicAttributes
- SNS:AddPermission
- SNS:RemovePermission
- SNS:DeleteTopic
- SNS:Subscribe
- SNS:ListSubscriptionsByTopic
- SNS:Publish
- SNS:Receive
Resource: !Ref PipelineNotificationTopic
Condition:
StringEquals:
AWS:SourceOwner: !Sub ${AWS::AccountId}
Topics:
- !Ref PipelineNotificationTopic
Posted in AWS, CodePipeline, DevOps, SNS | 2 Comments »
Posted by Diego on November 13, 2019
I recently was working on a “ EMR Notebook” attached to a cluster and noticed a strange behaviour as I was getting the infamous “An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied” error while trying to read data from S3 from a Python notebook.
So, I checked the notebook’s role (which is called EMR_Notebooks_DefaultRole and is the default role you are presented with when creating a Notebook) and noticed it HAD the “AmazonS3FullAccess” policy attached to it.
To my further amusement, I realized that I DIDN’T receive that error when running similar code from a PySpark notebook EVEN TOUGH the IAM role attached to the EMR Cluster that the notebook was connected, DIDN’T have S3 permissions;
Very quickly I realised that:
So great, one problem “solved” – but I was still clueless why the NB didn’t have access to S3.
While debugging the logs associated with the boto3 call to S3, I came across an awkward response from the API call:
Response body: b'
<?xml version="1.0" encoding="UTF-8"?>n
<Error>
<Code>AuthorizationHeaderMalformed</Code>
<Message>The authorization header is malformed; the region 'us-east-1' is wrong; expecting 'eu-west-1'
</Message>
...
And I say “awkward”, because all my resources were created on eu-west-1….so why was it using eu-east-1?
So, I opened a terminal (on the notebook – not on ERM) and checked that the role it ACTUALLY uses when submitting API calls is called “prod-EditorInstanceRole”
Upon further research, I’ve learned that this is a role maintained by AWS and it does not have permission to the resources in your account, because “EMR notebook” is a managed service and it is designed to grant only limited feature and permission, and it cannot assume customer role to access s3 resources; 😐
The ”eu-east-1” message, I can only assume is happening because that’s where the underlying EC2 instances are running from (?)
The role EMR_Notebooks_DefaultRole is the notebook service role which is used by the notebook service to manage the AWS resources, for example provisioning EMR cluster, loading and saving notebooks to s3. This role is not assumed to run command line or code on the notebook instance.
I guess that, if someone needs to use their IAM roles to manage resources from the notebook locally, they would need to launch an EMR cluster with JupyterHub.
In summary:
Posted in AWS, EMR | Leave a Comment »
Posted by Diego on November 11, 2019
This is very useful if you don’t want to write a template from scratch
Posted in AWS, CodePipeline, DevOps | Leave a Comment »
Posted by Diego on November 7, 2019
This is particularly useful if you want to:
SELECTproname, n.nspname||'.'||p.proname||'('||pg_catalog.oidvectortypes(p.proargtypes) ||')'assignature, prosrc as body
FROM pg_catalog.pg_namespace n
JOIN pg_catalog.pg_proc p ON pronamespace = n.oid
Posted in AWS, Redshift | Leave a Comment »
Posted by Diego on October 18, 2019
The code below is a pretty straightforward example on how to create a Sklearn estimator and run a training job using SageMaker Python SDK.
import sagemaker
from sagemaker import get_execution_role
sagemaker_session = sagemaker.Session()
role = get_execution_role()
script_path = 'myPythonFile.py'
source_dir = 'myFolder'
sklearn = SKLearn(
entry_point=script_path,
train_instance_type="ml.c4.xlarge",
role=role,
source_dir = source_dir,
sagemaker_session=sagemaker_session,
hyperparameters={'max_leaf_nodes': 10}) training_job = sklearn.fit({'train': train_input} , job_name='myJob')
It works perfectly fine if the “entry_point” script and the “source_dir” directory are on the same location as the code is being executed (a SageMaker notebook for example), however, if you try to use files located on S3, like so:
source_dir = "s3://mybucket/myfolder/ "
you will get one of the errors below at the “Invoking user training script” step:
“HeadObject not found”
“UnexpectedStatusException: Error for Training job testSMSM10: Failed. Reason: AlgorithmError: framework error”
“tarfile.ReadError: empty file”
“tarfile.EmptyHeaderError: empty header”
That happens because, if referencing S3, the source_dir must point to a .tar.gz file in a s3 bucket and not just the directory itself, which is not mentioned anywhere on the documentation
source_dir Path (absolute or relative) to a directory with any other training source code dependencies including the entry point file. Structure within this directory will be preserved when training on SageMaker.
So, your source_dir should be:
source_dir = "s3:// mybucket / myfolder /sklearn.tar.gz
where sklearn.tar.gz contains all the required files.
Oh, and BTW, it must be a tar.gz file, otherwise, you’ll get an error like “OSError: Not a gzip”
Posted in AWS, Machine Learning, SageMaker | Leave a Comment »