Problem Slicer: 2021

Thursday, 16 December 2021

12 Factor App Principles

The Twelve-Factor App methodology is a methodology for building software-as-a-service applications. These best practices are designed to enable applications to be built with portability and resilience when deployed to the web.

1. Code base

use version control, one repo per application, git submodule, maven submodule

2. Dependencies:

don't check jar in git, use maven, artifactory, gcp artifact

3. Config:

configuration should be strictly separated from code. As per "config", what varies for the environment to the environment must be moved to configurations and managed via environment variables.

- Database connections and credentials, system integration endpoints

- Credentials to external services such as Amazon S3 or Twitter or any other external apps

- Application-specific information like IP Addresses, ports, and hostnames, etc.

Principle: can you make your app opensource at anytime without compromising credentials

4. Backing service:

Database, Message Brokers, any other external systems that the app communicates is treated as Backing service. Treat backing services as attached resources. like messaging service, postgre sql with signle url that can be replaced without making your code change

5. Build, release, run:

Build stage: transform the code into an executable bundle/ build package.

Release stage: get the build package from the build stage and combines with the configurations of the deployment environment and make your application ready to run.

Run stage: It is like running your app in the execution environment.

Strictly separate the process with single command. You can use CI/CD tools to automate the builds and deployment process. Docker images make it easy to separate the build, release, and run stages more efficiently.

6. Processes:

Execute the app as one or more stateless processes

As per 12-factor principles, the application should not store the data in in-memory and it must be saved to a store and use from there. As far as the state concern, your application should store the state in the database instead of in memory of the process.

Avoid using sticky sessions, using sticky sessions are a violation of 12-factor app principles. If you would store the session information, you can choose redis or memcached or any other cache provider based on your requirements.

7. Port binding :

Export services via port binding. The web app exports HTTP as a service by binding to a port and listening to requests coming in on that port. Spring boot is one example of this one. Spring boot by default comes with embedded tomcat, jetty, or undertow.

8. Concurrency:

By adopting the containerization, applications can be scaled horizontally as per the demands.

9. Disposability

Maximize robustness with fast Startup and graceful shutdown. Docker containers can be started or stopped instantly. Storing request, state, or session data in queues or other backing services ensures that a request is handled seamlessly in the event of a container crash.

10. Dev/prod parity

Keep development, staging, and production as similar as possible. This reduces the risks of showing up bugs in a specific environment.

11. Logs

Treat logs as event streams

observability is the first-class citizen. Observability can be achieved through using APM tools (ELK, Newrelic, and other tools) or log aggregations tools like Splunk, logs, etc.

12. Admin processes

Run admin/management tasks as one-off processes. Any needed admin tasks should be kept in source control and packaged with the application.

Twelve-factor principles advocates for keeping such administrative tasks as part of the application codebase in the repository. By doing so, one-off scripts follow the same process defined for your codebase.

Ensure one-off scripts are automated so that you don't need to worry about executing them manually before releasing the build. Twelve-factor principles also suggest using the built-in tool of the execution environment to run those scripts on production servers.

Ref: https://12factor.net/

Friday, 18 June 2021

Continuous Integration Vs Continuous Delivery Vs Continuous Deployment

Delivery of development and tested code is considered continuous integration

Friday, 30 April 2021

Publish maven project to google cloud (GCP) Artifact Registry

Enable the Artifact Registry API.
Install and initialize the Cloud SDK.
Create the repository, you might not see the option maven. It is disabled as it is still in alpha version. To enable the option fill the form. it will take some time to approve. Lets assume your repository name is "quickstart-maven-repo" and location you selected is "us-central1"

Now go to your command prompt and login to gcloud console

$gcloud auth login

$gcloud config set project <myProject>

Set the repository, run the command:

$gcloud config set artifacts/repository quickstart-maven-repo

set the location

$gcloud config set artifacts/location us-central1

Create a service account from google cloud console or run the below command

$gcloud artifacts repositories add-iam-policy-binding quickstart-maven-repo --location=us-central1 --member='serviceAccount:ACCOUNT' --role='roles/artifactregistry.writer'

Where ACCOUNT is the ID of your service account in the format USERNAME@PROJECT-ID.iam.gserviceaccount.com

Download the service account key

$gcloud iam service-accounts keys create mykey.json --iam-account=USERNAME@PROJECT-ID.iam.gserviceaccount.com

$export GOOGLE_APPLICATION_CREDENTIALS=mykey.json

Where mykey.json is gnereated in previous step.

Now its time to configure the maven project
Choose a Maven project that you want to use. and go to the root directory of the project.
Run the following command to print the settings for the default quickstart-maven-repo repository.

$gcloud artifacts print-settings mvn

The output should look like below

<distributionManagement>

<snapshotRepository>

<id>artifact-registry</id>

<url>artifactregistry://us-central1-maven.pkg.dev/PROJECT/quickstart-maven-repo</url>

</snapshotRepository>

<repository>

<id>artifact-registry</id>

<url>artifactregistry://us-central1-maven.pkg.dev/PROJECT/quickstart-maven-repo</url>

</repository>

</distributionManagement>

<repositories>

<repository>

<id>artifact-registry</id>

<url>artifactregistry://us-central1-maven.pkg.dev/PROJECT/quickstart-maven-repo</url>

<releases>

<enabled>true</enabled>

</releases>

<snapshots>

<enabled>true</enabled>

</snapshots>

</repository>

</repositories>

<build>

<extensions>

<extension>

<groupId>com.google.cloud.artifactregistry</groupId>

<artifactId>artifactregistry-maven-wagon</artifactId>

<version>2.1.1</version>

</extension>

</extensions>

</build>

Add the output to the pom.xml

Run the below command to publish to the repo

$mvn clean deploy

and see the magic. go to google cloud console and then artifact-registry , you should see your published jar

For dependent project to access the published jar add the below settings to pom.xml

<repositories>

<repository>

<id>artifact-registry</id>

<url>artifactregistry://us-central1-maven.pkg.dev/PROJECT/quickstart-maven-repo</url>

<releases>

<enabled>true</enabled>

</releases>

<snapshots>

<enabled>true</enabled>

</snapshots>

</repository>

</repositories>

<build>

<extensions>

<extension>

<groupId>com.google.cloud.artifactregistry</groupId>

<artifactId>artifactregistry-maven-wagon</artifactId>

<version>2.1.1</version>

</extension>

</extensions>

</build>

and then run the command

$mvn clean compile

Project should compile and download all the dependency.

Ref: https://cloud.google.com/artifact-registry/docs/java/quickstart

Tuesday, 9 February 2021

Mongodb replica with docker compose

Clone the repo to get the docker-compose.yml and related files

https://github.com/himadrica/mongodb-replica

You will see the below files after cloning the repo

To run a standalone server:

$docker-compose -f docker-compose-standalone.yml up -d

To run replicated servers:

$chmod 400 resource/mongod-keyfile

$./setup_replica.sh

$docker exec -it mongodb1 bash

$mongo -u root -p admin

Enjoy!

Monday, 1 February 2021

Kafka basics

What is Kafka?

Apache Kafka is an open-source stream-processing software platform developed by LinkedIn and donated to the Apache Software Foundation, written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. (Wikipedia).

Apache Kafka is a publish-subscribe based durable messaging system. A messaging system sends messages between processes, applications, and servers.

Topics:

A Topic is a category/feed name to which records are stored and published. A topic is a particular stream of data. Similar to a table name in a database.

Partitions:

Kafka topics are divided into a number of partitions, which contain records in an unchangeable sequence. Each record in a partition is assigned and identified by its unique offset. A topic can also have multiple partition logs. This allows multiple consumers to read from a topic in parallel. Each message in partition gets an incremental id called offset.

Offset:

Offset are like indexes in an array.
Order is guaranteed only within a partition (not across partitions)
Data is kept only for a limited time (Default is one week)
Data is assigned randomly to a partition unless a key is provided

Broker & Cluster:

Cluster is a collection of brokers. Brokers are the Kafka servers. Every Kafka broker is also called a “bootstrap server”. It means you only need to connect to one broker and you will be connected to the entire cluster.

Leader:

At any time one broker can be a leader for a given partition
Only that leader can receive and serve data for a partition
The other brokers will synchronize the data
Therefore each partition has one leader and multiple ISR ( in-sync replica)

Replicas:

Replicas are nothing but backups of a partition. If the replication factor of a topic is set to 4, then Kafka will create four identical replicas of each partition and place them in the cluster to make them available for all its operations. Replicas are never used to read or write data. They are used to prevent data loss.

Producers:

Producers writes data to topics

Message Key:

Producers can choose to send a key with the message

key=null: data is sent round robin (Broker0 then Broker1 then Broker2)
key!=null: all messages for that key will always go to the same partition.

A key is sent if you need message ordering for a specific field.

Key Hashing:

By default it uses "murmurmur2" algorithm
Formula: targetPartion = Utils.abs(Utils.murmur2(recover.key())) % numPartitions

adding/removing partitions to a topic will completely alter the formula

Acknowledgement:

Producers can choose to receive acknowledgment of data writes

ack = 0: Producers won't wait for acknowledgment ( Possible data loss)
ack = 1: Producers will wait for leader acknowledgment ( limited data loss)

Leader response is requested but replication is not a guarantee.
If ack is not received, the producer may retry
If leader broker goes offline but replicas haven't replicated the data yet, we have a loss of data

ack = all: Leaders + replicas acknowledgment ( no data loss)

Acks=all must be used in conjunction with min.insync.replicas
min.insync.replicas can be set a the broker or topic level (Override)
min.insync.replicas=2 implies that at least 2 brokers that are ISR(including leader) must response that they have data
That means if you use replication.factor = 3, min.insync=2, ack=all, you can only tolerate I broker going down, otherwise the producer will receive an exception on send.

enable.idempotence=true ( producer level ) + min.insync.replicas=2 ( brokder/topic level)

implies ack=all, retries=MAX_INT, max.in.flight.requests.per.connection=5 (default)

Compression:

Producer usually send data that is text-based e.g. with JSON data which are large in size
In this case, it is important to apply compression to the producer
Compression is enabled at the producer level doesn't require change at broker or in the consumer

Compression Type:

"compression.type" can be 'none' (default), 'gzip', 'Iz4', 'snappy'
compression is more effective on the bigger batch of data
Always use compression if you have high throughput
consider tweaking linger.ms and batch.size to have bigger batches and therefore more compression and higher throughput.

By default, kafka tries to send records as soon as possible
It will have up to 5 request in flight, meaning up to 5 messages individually sent at the same time
After this if more message have to be sent while others are in flight, kafka is smart and will start batching them while they wait to send them all at once.

Linger.ms:

Number of milliseconds a producer is willing to wait before sending a batch out. ( default 0)

By introducing some lag ( for example linger.ms=5 ) we increase the chances of messages being sent together in a batch
By introducing a small delay, we can increase throughput, compression and efficiency of a producer
If a batch is full ( batch.size ) before the end of the linger.ms period, it will be sent to kafka right away!

batch:size:

Maximum number of bytes that will be included in a batch. The default is 16KB.

Increasing a batch size to 32KB or 64KB can help increasing the compression, throughput, and efficiency
Any message that is bigger than the batch size will not be batched
A batch is allocated per partition, so make sure that you don't set it to a number that is too high other it will waste memory
You can monitor the average batch size metric using kafka producer Metrics

Advantages:

Much smaller producer request size
Low latency
Better throughput
Store messages on disk are smaller in broker

Disadvantages:

Producers must commit some CPU cycles to compression
Consumers must commit some CPU cycles to decompression

Note: If the producer produces faster than the broker can take, the border can take the records will buffer in buffer.memory and fill back down when the throughput to the broker increases max.block.ms=60000: the time .send() will block until throwing an exception.

The producer has fill up its buffer
The broker is not accepting any new data
60 seconds has elapsed

Consumers:

Consumers read data from topics

Kafka stores the offsets at which a consumer group has been reading
It will be stored in a Kafka topic and that Kafka topic is named __consumer_offsets.
When consumer in a group has processed data received from kafka, it should be comitting the offsets
If a consumer dies it will be able to read back from where it left off.

Delivery Semantics:

At most once: Offsets are committed as soon as the message batch is received. If the processing goes wrong, the message will be lost.
At least once (usually): Offsets are committed after the message is processed. If the processing goes wrong, the message will be read again. This can result in duplicate processing of messages. Make sure your processing is idempotent ( unique)
Exactly once: It can be achieved for Kafka to Kafka workflows using Kafka Streams API. For Kafka to External System workflows use an idempotent consumer.

There are two ways to make consumer record idempotent ( Unique)

Kafka generic id:- You can take the help of kafka to generate unique id by appending simple strings like String id = record.topic()+"-"+record.partition()+"-"+record.offset();
Application supplied unique value: You can generate unique value from producer supplied record.

Consumer offset strategy:

enable.auto.commit = true & synchronous processing of batches, offsets will be committed automatically for you at regular interval by default auto.commit.interval.ms=5000, every time your call .poll(), if you don't use synchronous processing, you will be in "at-most-once" behavior because offsets will be committed before your data is processed
enable.auto.commit = false & synchronous processing of batches. you control when you commit offsets and what's the condition for committing them.

Consumer offset reset strategy:

auto.offset.reset=latest // will read from the end of the log
auto.offset.reset=earliest // will read from the start of the log
auto.offset.reset=none // will throw exception if no offset is found

if a consumer hasn't read new data in 7 days, consumer offset can be lost, it can be controlled by offset.retention.minutes

To Replay data for a consumer group

Take all consumer from a specific group down

Use kafka-consumer-groups command to set offset to what you want restart consumers

Poll Behavior:

fetch.min.bytes:

Controls how much data you want to pull at least on each request
Helps improving throughput and decreasing request number
At the cost of latency

Max.poll.records ( default 500)

Controls how many records to receive per poll request
Increases if you messages are very small and have a lot of available RAM
Good to monitor how many records are polled per request.

Considerations

set proper data retention period & offset retention period
Ensure the auto offset reset behavior is the one you expect / want
use replay capability in case of unexpected behavior

Zookeeper:

Manages brokers keeps a list of them
It helps in performing leader election for partitions.
It sends the notification to Kafka in case of changes. ( e.g. new topic, broker dies, broker comes up, delete topic etc.)
Kafka can not run without zookeeper.
It by design operates with an odd number of servers.
It has a leader(Leader handle the writes from the brokers) the rest of the servers are followers (handle reads).
Zookeeper does not store consumer offsets with Kafka.

Kafka Guarantees:

Messages are appended to a topic-partition in the order they are sent.
Consumers read messages in the order stored in a topic-partition.
With a replication factor of N, producers and consumers can tolerate up to N-1 brokers being down.
As long as the number of partitions remains constant for a topic, the same key will always go to the same partition.

Resources:

Sample Producer Consumers: https://github.com/himadrica/kafka-producer-consumer

Docker installation: https://github.com/simplesteph/kafka-stack-docker-compose

Reference:

Monday, 11 January 2021

Linux Awk scripting cheatsheet

What is awk?

It’s a full scripting language, as well as a complete text manipulation toolkit for the command line.

Awk is used for to transform data files and produce formatted report.

They way it works

Scans a file line by line
Splits each input line into fields
Compare input line/fields to pattern
Performs action on matches lines

in the terminal if you type awk and hit enter you should see the blow output which will show the parameters it accepts and the format of the command.

/$ awk

Usage: awk [POSIX or GNU style options] -f progfile [--] file ...

Usage: awk [POSIX or GNU style options] [--] 'program' file ...

POSIX options: GNU long options: (standard)

-f progfile --file=progfile

-F fs --field-separator=fs

-v var=val --assign=var=val

Short options: GNU long options: (extensions)

-b --characters-as-bytes

-c --traditional

-C --copyright

-d[file] --dump-variables[=file]

-e 'program-text' --source='program-text'

-E file --exec=file

-g --gen-pot

-h --help

-L [fatal] --lint[=fatal]

-n --non-decimal-data

-N --use-lc-numeric

-O --optimize

-p[file] --profile[=file]

-P --posix

-r --re-interval

-S --sandbox

-t --lint-old

-V --version

To report bugs, see node `Bugs' in `gawk.info', which is

section `Reporting Problems and Bugs' in the printed version.

gawk is a pattern scanning and processing language.

By default it reads standard input and writes standard output.

Examples:

gawk '{ sum += $1 }; END { print sum }' file

gawk -F: '{ print $1 }' /etc/passwd

Create file in any of the directory you choose with following contents

A,AB,ABC,ABCD

B,BA,CBA,C200

C,AC,ACB,100b

D,CD,BCD,98

F,GH,ABC,XYZ,LF

awk -F, '{ print }' file // -F, is the separator, here the separator is ,

A,AB,ABC,ABCD

B,BA,CBA,C200

C,AC,ACB,100b

D,CD,BCD,98

F,GH,ABC,XYZ,LF

$awk -F',' '{ print $1}' file

$0: Represents the entire line of text.

$1: Represents the first field.

$2: Represents the second field.

$7: Represents the seventh field.

$45: Represents the 45th field.

$awk -F',' '{ print $1, $3}' file

A ABC

B CBA

C ACB

D BCD

F ABC

OFS (output field separator) variable to put a separator between fields

$awk -F',' 'OFS="/" { print $1, $3}' file

A/ ABC

B/ CBA

C/ ACB

D/ BCD

F/ ABC

Replacing all the values of column 2

$awk -F',' '{$2="1";print }' file

A 1 ABC ABCD

B 1 CBA C200

C 1 ACB 100b

D 1 BCD 98

F 1 ABC XYZ LF

Replacing all the values of colum 2 and putting a quote arround it

$awk -F, '{$2="\"1\"";print }' file

A "1" ABC ABCD

B "1" CBA C200

C "1" ACB 100b

D "1" BCD 98

F "1" ABC XYZ LF

Number of cell in per row after splitting by ,

$awk -F, '{ print NF }' file

A BEGIN rule is executed once before any text processing starts. In fact, it’s executed before awk even reads any text. An END rule is executed after all processing has completed. You can have multiple BEGIN and END rules, and they’ll execute in order.

$awk -F',' 'BEGIN {print "Hello world"} { print $0}' file

Hello world

A,AB,ABC,ABCD

B,BA,CBA,C200

C,AC,ACB,100b

D,CD,BCD,98

F,GH,ABC,XYZ,LF

$awk 'END { print NR } { print }' file

A,AB,ABC,ABCD

B,BA,CBA,C200

C,AC,ACB,100b

D,CD,BCD,98

F,GH,ABC,XYZ,LF

To print the first item along with the row number(NR)

$awk -F, '{ print NR ", " $0 }' file

1,A,AB,ABC,ABCD

2,B,BA,CBA,C200

3,C,AC,ACB,100b

4,D,CD,BCD,98

5,F,GH,ABC,XYZ,LF

Conditions and regular expressions

$awk -F, '$4 > 90 { print }' file

D,CD,BCD,98

$awk -F, '$3 ~ /A/ { print $0 }' file

A,AB,ABC,ABCD

B,BA,CBA,C200

C,AC,ACB,100b

F, GH, ABC, XYZ, LF

$awk -F, '$3 ~ /^A/ { print $0 }' file

A,AB,ABC,ABCD

C,AC,ACB,100b

F,GH,ABC,XYZ,LF

for loops in awk:

$awk 'BEGIN { for(i=1;i<=6;i++) print "square of", i, "is",i*i; }'

square of 1 is 1

square of 2 is 4

square of 3 is 9

square of 4 is 16

square of 5 is 25

square of 6 is 36

$awk -F, 'length($4) > 3' file

A,AB,ABC,ABCD

B,BA,CBA,C200

C,AC,ACB,100b

awk if conditions

$awk -F, '{ if($4 == "ABCD") print $0;}' file

A,AB,ABC,ABCD

Home Navigation

Thursday, 16 December 2021

12 Factor App Principles

Friday, 18 June 2021

Continuous Integration Vs Continuous Delivery Vs Continuous Deployment

Friday, 30 April 2021

Publish maven project to google cloud (GCP) Artifact Registry

Tuesday, 9 February 2021

Mongodb replica with docker compose

Monday, 1 February 2021

Kafka basics

What is Kafka?

Topics:

Partitions:

Offset:

Broker & Cluster:

Leader:

Replicas:

Producers:

Compression:

Consumers:

Delivery Semantics:

Zookeeper:

Kafka Guarantees:

Resources:

Monday, 11 January 2021

Linux Awk scripting cheatsheet